EJERCICIOS
LATTICE
xyplot(Ozone ~ Wind | as.factor(Month), data = airquality, layout=c(5,1))
That’s correct!
Since Month is a named column of the airquality dataframe we had to tell R to treat it as a
factor. To see how this affects the plot, rerun the xyplot command you just ran, but use Ozone ~
Wind | Month instead of Ozone ~ Wind | as.factor(Month) as the first argument.
xyplot(Ozone ~ Wind | Month, data = airquality, layout=c(5,1))
Your dedication is inspiring!
|=========================== | 30%
Not as informative, right? The word Month in each panel really doesn’t tell you much if it
doesn’t identify which month it’s plotting. Notice that the actual data is the same between the
two plots, though.
…
|============================ | 31%
Lattice functions behave differently from base graphics functions in one critical way. Recall
that base graphics functions plot data directly to the graphics device (e.g., screen, or file
such as a PDF file). In contrast, lattice graphics functions return an object of class trellis.
…
|============================= | 33%
The print methods for lattice functions actually do the work of plotting the data on the
graphics device. They return “plot objects” that can be stored (but it’s usually better to just
save the code and data). On the command line, trellis objects are auto-printed so that it
appears the function is plotting the data.
To see this, create a variable p which is assigned the output of this simple call to xyplot,
xyplot(Ozone~Wind,data=airquality).
p <- xyplot(Ozone~Wind,data=airquality)
Excellent work!
|================================ | 36%
Nothing plotted, right? But the object p is around.
…
|================================= | 37%
Type p or print(p) now to see it.
print(p)
Excellent work!
Like magic, it appears. Now run the R command names with p as its argument.
names(p) [1] “formula” “as.table” “aspect.fill” “legend”
[5] “panel” “page” “layout” “skip”
[9] “strip” “strip.left” “xscale.components” “yscale.components” [13] “axis” “xlab” “ylab” “xlab.default”
[17] “ylab.default” “xlab.top” “ylab.right” “main”
[21] “sub” “x.between” “y.between” “par.settings”
[25] “plot.args” “lattice.options” “par.strip.text” “index.cond”
[29] “perm.cond” “condlevels” “call” “x.scales”
[33] “y.scales” “panel.args.common” “panel.args” “packet.sizes”
[37] “x.limits” “y.limits” “x.used.at” “y.used.at”
[41] “x.num.limit” “y.num.limit” “aspect.ratio” “prepanel.default” [45] “prepanel”
You’re the best!
|==================================== | 40%
We see that the trellis object p has 45 named properties, the first of which is “formula” which
isn’t too surprising. A lot of these properties are probably NULL in value. We’ve done some
behind-the-scenes work for you and created two vectors. The first, mynames, is a character
vector of the names in p. The second is a boolean vector, myfull, which has TRUE values for
nonnull entries of p. Run mynames[myfull] to see which entries of p are not NULL.
We see that the trellis object p has 45 named properties, the first of which is “formula” which
isn’t too surprising. A lot of these properties are probably NULL in value. We’ve done some
behind-the-scenes work for you and created two vectors. The first, mynames, is a character
vector of the names in p. The second is a boolean vector, myfull, which has TRUE values for
nonnull entries of p. Run mynames[myfull] to see which entries of p are not NULL.
mynames[myfull][1] “formula” “as.table” “aspect.fill” “panel”
[5] “skip” “strip” “strip.left” “xscale.components” [9] “yscale.components” “axis” “xlab” “ylab”
[13] “xlab.default” “ylab.default” “x.between” “y.between”
[17] “index.cond” “perm.cond” “condlevels” “call”
[21] “x.scales” “y.scales” “panel.args.common” “panel.args”
[25] “packet.sizes” “x.limits” “y.limits” “aspect.ratio”
[29] “prepanel.default”
That’s the answer I was looking for.
|===================================== | 42%
Wow! 29 nonNull values for one little plot. Note that a lot of them are like the ones we saw in
the base plotting system. Let’s look at the values of some of them. Type p[[“formula”]] now.
p[[“formula”]] Ozone ~ Wind
You got it!
|======================================= | 43%
Not surprising, is it? It’s a familiar formula. Now look at p’s x.limits. Remember the double
square brackets and quotes.
p[[“x.limits”]][1] 0.37 22.03
That’s the answer I was looking for.
|======================================== | 45%
They match the plot, right? The x values are indeed between .37 and 22.03.
…
|========================================= | 46%
Again, not surprising. Before we wrap up, let’s talk about lattice’s panel functions which
control what happens inside each panel of the plot. The ease of making multi-panel plots makes
lattice very appealing. The lattice package comes with default panel functions, but you can
customize what happens in each panel.
…
|=========================================== | 48%
Panel functions receive the x and y coordinates of the data points in their panel (along with
any optional arguments). To see this, we’ve created some data for you - two 100-long vectors, x
and y. For its first 50 values y is a function of x, for the last 50 values, y is random. We’ve
also defined a 100-long factor vector f which distinguishes between the first and last 50
elements of the two vectors. Run the R command table with f as it argument.
COLORS
So we’ll first discuss some functions that the grDevices package offers. The function colors()
lists the names of 657 predefined colors you can use in any plotting function. These names are
returned as strings. Run the R command sample with colors() as its first argument and 10 as its
second to give you an idea of the choices you have.
sample(colors(), 10) [1] “lightyellow2” “ivory2” “palevioletred1” “grey100” “gray76”
[6] “coral” “mediumpurple2” “rosybrown4” “grey94” “chartreuse”
That’s the answer I was looking for.
|============ | 13%
We see a lot of variety in the colors, some of which are names followed by numbers indicating
that there are multiple forms of that particular color.
…
|============= | 14%
So you’re free to use any of these 600+ colors listed by the colors function. However, two
additional functions from grDevices, colorRamp and colorRampPalette, give you more options. Both
of these take color names as arguments and use them as “palettes”, that is, these argument
colors are blended in different proportions to form new colors.
…
|============== | 16%
The first, colorRamp, takes a palette of colors (the arguments) and returns a function that
takes values between 0 and 1 as arguments. The 0 and 1 correspond to the extremes of the color
palette. Arguments between 0 and 1 return blends of these extremes.
…
|=============== | 17%
Let’s see what this means. Assign to the variable pal the output of a call to colorRamp with the
single argument, c(“red”,“blue”).
pal <- colorRamp(c(“red”,“blue”))
Excellent job!
|================= | 19%
We don’t see any output, but R has created the function pal which we can call with a single
argument between 0 and 1. Call pal now with the argument 0.
pal(0) [,1] [,2] [,3][1,] 255 0 0
Nice work!
|================== | 20%
We see a 1 by 3 array with 255 as the first entry and 0 in the other 2. This 3 long vector
corresponds to red, green, blue (RGB) color encoding commonly used in televisions and monitors.
In R, 24 bits are used to represent colors. Think of these 24 bits as 3 sets of 8 bits, each of
which represents an intensity for one of the colors red, green, and blue.
…
|=================== | 22%
The 255 returned from the pal(0) call corresponds to the largest possible number represented
with 8 bits, so the vector (255,0,0) contains only red (no green or blue), and moreover, it’s
the highest possible value of red.
…
|===================== | 23%
Given that you created pal with the palette containing “red” and “blue”, what color do you think
will be represented by the vector that pal(1) returns? Recall that pal will only take arguments
between 0 and 1, so 1 is the largest argument you can pass it.
1: yellow 2: green 3: blue 4: red
Selection: 3
Keep up the great work!
|====================== | 25%
Check your answer now by calling pal with the argument 1.
pal(1) [,1] [,2] [,3][1,] 0 0 255
All that practice is paying off!
|======================= | 26%
You see the vector (0,0,255) which represents the highest intensity of blue. What vector do you
think the call pal(.5) will return?
1: (255,255,255) 2: (255,0,255) 3: (127.5,0,127.5) 4: (0,255,0)
Selection: 2
That’s not exactly what I’m looking for. Try again.
The correct answer should be halfway between (255,0,0) and (0,0,255). Which is the only choice
that is the average (mean) of these two?
1: (127.5,0,127.5) 2: (255,0,255) 3: (255,255,255) 4: (0,255,0)
Selection: 4
Nice try, but that’s not exactly what I was hoping for. Try again.
The correct answer should be halfway between (255,0,0) and (0,0,255). Which is the only choice
that is the average (mean) of these two?
1: (127.5,0,127.5) 2: (0,255,0) 3: (255,255,255) 4: (255,0,255)
Selection: 1
All that hard work is paying off!
|========================= | 28%
The function pal can take more than one argument. It returns one 3-long (or 4-long, but more
about this later) vector for each argument. To see this in action, call pal with the argument
seq(0,1,len=6).
pal(seq(0,1,len=6)) [,1] [,2] [,3][1,] 255 0 0 [2,] 204 0 51 [3,] 153 0 102 [4,] 102 0 153 [5,] 51 0 204 [6,] 0 0 255
Your dedication is inspiring!
|========================== | 29%
Six vectors (each of length 3) are returned. The i-th vector is identical to output that would
be returned by the call pal(i/5) for i=0,…5. We see that the i-th row (for i=1,…6) differs
from the (i-1)-st row in the following way. Its red entry is 51 = 255/5 points lower and its
blue entry is 51 points higher.
…
|=========================== | 30%
So pal creates colors using the palette we specified when we called colorRamp. In this example
none of pal’s outputs will ever contain green since it wasn’t in our initial palette.
…
|============================ | 32%
We’ll turn now to colorRampPalette, a function similar to colorRamp. It also takes a palette of
colors and returns a function. This function, however, takes integer arguments (instead of
numbers between 0 and 1) and returns a vector of colors each of which is a blend of colors of
the original palette.
…
|============================== | 33%
The argument you pass to the returned function specifies the number of colors you want returned.
Each element of the returned vector is a 24 bit number, represented as 6 hexadecimal characters,
which range from 0 to F. This set of 6 hex characters represents the intensities of red, green,
and blue, 2 characters for each color.
…
|=============================== | 35%
To see this better, assign to the variable p1 the output of a call to colorRampPalette with the
single argument, c(“red”,“blue”). We’ll compare it to our experiments using colorRamp.
p1 <- colorRampPalette(c(“red”,“blue”))
That’s a job well done!
|================================ | 36%
Now call p1 with the argument 2.
p1(2) [1] “#FF0000” “#0000FF”
Keep up the great work!
|================================== | 38%
We see a 2-long vector is returned. The first entry FF0000 represents red. The FF is hexadecimal
for 255, the same value returned by our call pal(0). The second entry 0000FF represents blue,
also with intensity 255.
…
|=================================== | 39%
Now call p1 with the argument 6. Let’s see if we get the same result as we did when we called
pal with the argument seq(0,1,len=6).
p1(6) [1] “#FF0000” “#CC0033” “#990066” “#650099” “#3200CC” “#0000FF”
Nice work!
|==================================== | 41%
Now we get the 6-long vector (FF0000, CC0033, 990066, 650099, 3200CC, 0000FF). We see the two
ends (FF0000 and 0000FF) are consistent with the colors red and blue. How about CC0033? Type
0xcc or 0xCC at the command line to see the decimal equivalent of this hex number. You must
include the 0 before the x to specify that you’re entering a hexadecimal number.
0xcc [1] 204
You nailed it! Good job!
|===================================== | 42%
So 0xCC equals 204 and we can easily convert hex 33 to decimal, as in 0x33=3*16+3=51. These were
exactly the numbers we got in the second row returned from our call to pal(seq(0,1,len=6)). We
see that 4 of the 6 numbers agree with our earlier call to pal. Two of the 6 differ slightly.
…
|======================================= | 43%
We can also form palettes using colors other than red, green and blue. Form a palette, p2, by
calling colorRampPalette with the colors “red” and “yellow”. Remember to concatenate them into a
single argument.
p1 <- colorRampPalette(c(“red”,“yellow”))
Not exactly. Give it another go. Or, type info() for more options.
Type p2 <- colorRampPalette(c(“red”,“yellow”)) at the command prompt.
p2 <- colorRampPalette(c(“red”,“yellow”))
You got it right!
|======================================== | 45%
Now call p2 with the argument 2. This will show us the two extremes of the blends of colors
we’ll get.
p2(2) [1] “#FF0000” “#FFFF00”
All that hard work is paying off!
|========================================= | 46%
Not surprisingly the first color we see is FF0000, which we know represents red. The second
color returned, FFFF00, must represent yellow, a combination of full intensity red and full
intensity green. This makes sense, since yellow falls between red and green on the color wheel
as we see here. (We borrowed this image from lucaskrech.com.)
…
|=========================================== | 48%
Let’s now call p2 with the argument 10. This will show us how the two extremes, red and yellow,
are blended together.
p2(10) [1] “#FF0000” “#FF1C00” “#FF3800” “#FF5500” “#FF7100” “#FF8D00” “#FFAA00” “#FFC600” “#FFE200” [10] “#FFFF00”
Excellent job!
|============================================ | 49%
So we see the 10-long vector. For each element, the red component is fixed at FF, and the green
component grows from 00 (at the first element) to FF (at the last).
…
|============================================= | 51%
This is all fine and dandy but you’re probably wondering when you can see how all these colors
show up in a display. We copied some code from the R documentation pages (color.scale if you’re
interested) and created a function for you, showMe. This takes as an argument, a color vector,
which as you know, is precisely what calls to p1 and p2 return to you. Call showMe now with
p1(20).
showMe(p1(20))
Great job!
|============================================== | 52%
We see the interpolated palette here. Low values in the lower left corner are red and as you
move to the upper right, the colors move toward blue. Now call showMe with p2(20) as its
argument.
showMe(p2(20))
All that hard work is paying off!
|================================================ | 54%
Here we see a similar display, the colors moving from red to yellow, the base colors of our p2
palette. For fun, see what p2(2) looks like using showMe.
showMe(p2(2))
That’s a job well done!
|================================================= | 55%
A much more basic pattern, simple but elegant.
…
|================================================== | 57%
We mentioned before that colorRamp (and colorRampPalette) could return a 3 or 4 long vector of
colors. We saw 3-long vectors returned indicating red, green, and blue intensities. What would
the 4th entry be?
Type p1 at the command prompt.
p1 function (n) { x <- ramp(seq.int(0, 1, length.out = n)) if (ncol(x) == 4L) rgb(x[, 1L], x[, 2L], x[, 3L], x[, 4L], maxColorValue = 255) else rgb(x[, 1L], x[, 2L], x[, 3L], maxColorValue = 255) }
So the fourth argument is alpha which can be a logical, i.e., either TRUE or FALSE, or a
numerical value. Create the function p3 now by calling colorRampPalette with the colors blue and
green (remember to concatenate them into a single argument) and setting the alpha argument to
.5.
p3 <- colorRampPalette(c(“blue”,“green”),.5)
That’s not exactly what I’m looking for. Try again. Or, type info() for more options.
Type p3 <- colorRampPalette(c(“blue”,“green”),alpha=.5) at the command prompt.
p3 <- colorRampPalette(c(“blue”,“green”),alpha=.5)
All that hard work is paying off!
|========================================================== | 65%
Now call p3 with the argument 5.
p3(5) [1] “#0000FFFF” “#003FBFFF” “#007F7FFF” “#00BF3FFF” “#00FF00FF”
You are doing so well!
|=========================================================== | 67%
We see that in the 5-long vector that the call returned, each element has 32 bits, 4 groups of 8
bits each. The last 8 bits represent the value of alpha. Since it was NOT ZERO in the call to
colorRampPalette, it gets the maximum FF value. (The same result would happen if alpha had been
set to TRUE.) When it was 0 or FALSE (as in previous calls to colorRampPalette) it was given the
value 00 and wasn’t shown. The leftmost 24 bits of each element are the same RGB encoding we
previously saw.
…
|============================================================= | 68%
So what is alpha? Alpha represents an opacity level, that is, how transparent should the colors
be. We can add color transparency with the alpha parameter to calls to rgb. We haven’t seen any
examples of this yet, but we will now.
…
|============================================================== | 70%
We generated 1000 random normal pairs for you in the variables x and y. We’ll plot them in a
scatterplot by calling plot with 4 arguments. The variables x and y are the first 2. The third
is the print character argument pch. Set this equal to 19 (filled circles). The final argument
is col which should be set equal to a call to rgb. Give rgb 3 arguments, 0, .5, and .5.
GGPLOT
qplot(displ, hwy, data = mpg)
Great job!
|============================== | 34%
A nice scatterplot done simply, right? All the labels are provided. The first argument is shown
along the x-axis and the second along the y-axis. The negative trend (increasing displacement
and lower gas mileage) is pretty clear. Now suppose we want to do the same plot but this time
use different colors to distinguish between the 3 factors (subsets) of different types of drive
(drv) in the data (front-wheel, rear-wheel, and 4-wheel). Again, qplot makes this very easy.
We’ll just add what ggplot2 calls an aesthetic, a fourth argument, color, and set it equal to
drv. Try this now. (Use the up arrow key to save some typing.)
qplot(displ, hwy, data = mpg, color = drv)
Pretty cool, right? See the legend to the right which qplot helpfully supplied? The colors were
automatically assigned by qplot so the legend decodes the colors for you. Notice that qplot
automatically used dots or points to indicate the data. These points are geoms (geometric
objects). We could have used a different aesthetic, for instance shape instead of color, to
distinguish between the drive types.
Now let’s add a second geom to the default points. How about some smoothing function to produce
trend lines, one for each color? Just add a fifth argument, geom, and using the R function c(),
set it equal to the concatenation of the two strings “point” and “smooth”. The first refers to
the data points and second to the trend lines we want plotted. Try this now.
qplot(displ, hwy, data = mpg, color = drv, geom = c(“point”,“smooth”)) geom_smooth() using method = ‘loess’
Your dedication is inspiring!
|===================================== | 41%
Notice the gray areas surrounding each trend lines. These indicate the 95% confidence intervals
for the lines.
qplot(y=hwy, data = mpg, color = drv)
You nailed it! Good job!
|========================================= | 46%
What’s this plot showing? We see the x-axis ranges from 0 to 250 and we remember that we had 234
data points in our set, so we can infer that each point in the plot represents one of the hwy
values (indicated by the y-axis). We’ve created the vector myhigh for you which contains the hwy
data from the mpg dataset. Look at myhigh now.
The all-purpose qplot can also create box and whisker plots. Call qplot now with 4 arguments.
First specify the variable by which you’ll split the data, in this case drv, then specify the
variable which you want to examine, in this case hwy. The third argument is data (set equal to
mpg), and the fourth, the geom, set equal to the string “boxplot”
qplot(drv, hwy, data = mpg, geom = “boxplot”)
We see 3 boxes, one for each drive. Now to impress you, call qplot with 5 arguments. The first 4
are just as you used previously, (drv, hwy, data set equal to mpg, and geom set equal to the
string “boxplot”). Now add a fifth argument, color, equal to manufacturer.
qplot(drv, hwy, data = mpg, geom = “boxplot”, color = manufacturer)
It’s a little squished but we just wanted to illustrate qplot’s capabilities. Notice that there
are still 3 regions of the plot (determined by the factor drv). Each is subdivided into several
boxes depicting different manufacturers.
Now, on to histograms. These display frequency counts for a single variable. Let’s start with an
easy one. Call qplot with 3 arguments. First specify the variable for which you want the
frequency count, in this case hwy, then specify the data (set equal to mpg), and finally, the
aesthetic, fill, set equal to drv. Instead of a plain old histogram, this will again use colors
to distinguish the 3 different drive factors.
qplot(hwy, data = mpg, fill = drv) stat_bin() using bins = 30. Pick better value with binwidth.
Excellent work!
|====================================================== | 61%
See how qplot consistently uses the colors. Red (if 4-wheel drv is in the bin) is at the bottom
of the bin, then green on top of it (if present), followed by blue (rear wheel drv). The color
lets us see right away that 4-wheel drive vehicles in this dataset don’t have gas mileages
exceeding 30 miles per gallon.
It’s cool that qplot can do this so easily, but some people may find this multi-color histogram
hard to interpret. Instead of using colors to distinguish between the drive factors let’s use
facets or panels. (That’s what lattice called them.) This just means we’ll split the data into 3
subsets (according to drive) and make 3 smaller individual plots of each subset in one plot (and
with one call to qplot).
…
|=========================================================== | 66%
Remember that with base plot we had to do each subplot individually. The lattice system made
plotting conditioning plots easier. Let’s see how easy it is with qplot.
…
|============================================================= | 68%
We’ll do two plots, a scatterplot and then a histogram, each with 3 facets. For the scatterplot,
call qplot with 4 arguments. The first two are displ and hwy and the third is the argument data
set equal to mpg. The fourth is the argument facets which will be set equal to the expression .
~ drv which is ggplot2’s shorthand for number of rows (to the left of the ~) and number of
columns (to the right of the ~). Here the . indicates a single row and drv implies 3, since
there are 3 distinct drive factors. Try this now.
qplot(displ,hwy, data = mpg, facets = .~ drv)
The result is a 1 by 3 array of plots. Note how each is labeled at the top with the factor label
(4,f, or r). This shows us more detailed information than the histogram. We see the relationship
between displacement and highway mileage for each of the 3 drive factors.
Now we’ll do a histogram, again calling qplot with 4 arguments. This time, since we need only
one variable for a histogram, the first is hwy and the second is the argument data set equal to
mpg. The third is the argument facets which we’ll set equal to the expression drv ~ . . This
will give us a different arrangement of the facets. The fourth argument is binwidth. Set this
equal to 2. Try this now.
qplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)
A “grammar” of graphics means that ggplot2 contains building blocks with which you can create
your own graphical objects. What are these basic components of ggplot2 plots? There are 7 of
them.
…
|======= | 8%
Obviously, there’s a DATA FRAME which contains the data you’re trying to plot. Then the
AESTHETIC MAPPINGS determine how data are mapped to color, size, etc. The GEOMS (geometric
objects) are what you see in the plot (points, lines, shapes) and FACETS are the panels used in
conditional plots. You’ve used these or seen them used in the first ggplot2 (qplot) lesson.
…
|========= | 10%
There are 3 more. STATS are statistical transformations such as binning, quantiles, and
smoothing which ggplot2 applies to the data. SCALES show what coding an aesthetic map uses (for
example, male = red, female = blue). Finally, the plots are depicted on a COORDINATE SYSTEM.
When you use qplot these were taken care of for you.
We’ll keep using the mpg data that comes with the ggplot2 package. Recall the versatility of
qplot. Just as a refresher, call qplot now with 5 arguments. The first 3 deal with data - displ,
hwy, and data=mpg. The fourth is geom set equal to the concatenation of the two strings, “point”
and “smooth”. The fifth is facets set equal to the formula .~drv. Try this now.
qplot(displ,hwy, data = mpg, geom = c(“point”,“smooth”),facets = .~ drv)
We see a 3 facet plot, one for each drive type (4, f, and r). Now we’ll see how ggplot works.
We’ll build up a similar plot using the basic components of the package. We’ll do this in a
series of steps.
First we’ll create a variable g by assigning to it the output of a call to ggplot with 2
arguments. The first is mpg (our dataset) and the second will tell ggplot what we want to plot,
in this case, displ and hwy. These are what we want our aesthetics to represent so we enclose
these as two arguments to the function aes. Try this now.
g <- ggplot(mpg, aes(displ,hwy) )
Nice work!
|==================== | 23%
Notice that nothing happened? As in the lattice system, ggplot created a graphical object which
we assigned to the variable g.
…
|====================== | 25%
Run the R command summary with g as its argument to see what g contains.
Notice that nothing happened? As in the lattice system, ggplot created a graphical object which
we assigned to the variable g.
…
|====================== | 25%
Run the R command summary with g as its argument to see what g contains.
summary(g) data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl, class [234x11] mapping: x = displ, y = hwy faceting: compute_layout: function draw_back: function draw_front: function draw_labels: function draw_panels: function finish_data: function init_scales: function map: function map_data: function params: list render_back: function render_front: function render_panels: function setup_data: function setup_params: function shrink: TRUE train: function train_positions: function train_scales: function vars: function super:
You nailed it! Good job!
|======================== | 27%
So g contains the mpg data with all its named components in a 234 by 11 matrix. It also contains
a mapping, x (displ) and y (hwy) which you specified, and no faceting.
…
|========================== | 29%
Note that if you tried to print g with the expressions g or print(g) you’d get an error! Even
though it’s a great package, ggplot doesn’t know how to display the data yet since you didn’t
specify how you wanted to see it. Now type g+geom_point() and see what happens.
Note that if you tried to print g with the expressions g or print(g) you’d get an error! Even
though it’s a great package, ggplot doesn’t know how to display the data yet since you didn’t
specify how you wanted to see it. Now type g+geom_point() and see what happens.
g+geom_point()
Nice work!
|============================ | 31%
By calling the function geom_point you added a layer. By not assigning the expression to a
variable you displayed a plot. Notice that you didn’t have to pass any arguments to the function
geom_point. That’s because the object g has all the data stored in it. (Remember you saw that
when you ran summary on g before.) Now use the expression you just typed (g + geom_point()) and
add to it another layer, a call to geom_smooth(). Notice the red message R gives you.
g+geom_point()+geom_smooth() geom_smooth() using method = ‘loess’
All that hard work is paying off!
The gray shadow around the blue line is the confidence band. See how wide it is at the right?
Let’s try a different smoothing function. Use the up arrow to recover the expression you just
typed, and instead of calling geom_smooth with no arguments, call it with the argument method
set equal to the string “lm”.
g+geom_point()+geom_smooth(method = “lm”)
By changing the smoothing function to “lm” (linear model) ggplot2 generated a regression line
through the data. Now recall the expression you just used and add to it another call, this time
to the function facet_grid. Use the formula . ~ drv as it argument. Note that this is the same
type of formula used in the calls to qplot.
g+geom_point()+geom_smooth(method = “lm”)+facet_grid(. ~ drv)
Notice how each panel is labeled with the appropriate factor. All the data associated with
4-wheel drive cars is in the leftmost panel, front-wheel drive data is shown in the middle
panel, and rear-wheel drive data in the rightmost. Notice that this is similar to the plot you
created at the start of the lesson using qplot. (We used a different smoothing function than
previously.)
…
|=================================== | 40%
So far you’ve just used the default labels that ggplot provides. You can add your own annotation
using functions such as xlab(), ylab(), and ggtitle(). In addition, the function labs() is more
general and can be used to label either or both axes as well as provide a title. Now recall the
expression you just typed and add a call to the function ggtitle with the argument “Swirl
Rules!“.
g+geom_point()+geom_smooth(method = “lm”)+facet_grid(. ~ drv)+ggtitle(“Swirl Rules!”)
Now that you’ve seen the basics we’ll talk about customizing. Each of the “geom” functions
(e.g., _point and _smooth) has options to modify it. Also, the function theme() can be used to
modify aspects of the entire plot, e.g. the position of the legend. Two standard appearance
themes are included in ggplot. These are theme_gray() which is the default theme (gray
background with white grid lines) and theme_bw() which is a plainer (black and white) color
scheme.
…
|======================================= | 44%
Let’s practice modifying aesthetics now. We’ll use the graphic object g that we already filled
with mpg data and add a call to the function geom_point, but this time we’ll give geom_point 3
arguments. Set the argument color equal to “pink”, the argument size to 4, and the argument
alpha to 1/2. Notice that all the arguments are set equal to constants.
Let’s practice modifying aesthetics now. We’ll use the graphic object g that we already filled
with mpg data and add a call to the function geom_point, but this time we’ll give geom_point 3
arguments. Set the argument color equal to “pink”, the argument size to 4, and the argument
alpha to 1/2. Notice that all the arguments are set equal to constants.
g+geom_point(color = “pink”, size = 4, alpha = 1/2)
Notice the different shades of pink? That’s the result of the alpha aesthetic which you set to
1/2. This aesthetic tells ggplot how transparent the points should be. Darker circles indicate
values hit by multiple data points.
…
|=========================================== | 48%
Now we’ll modify the aesthetics so that color indicates which drv type each point represents.
Again, use g and add to it a call to the function geom_point with 3 arguments. The first is size
set equal to 4, the second is alpha equal to 1/2. The third is a call to the function aes with
the argument color set equal to drv. Note that you MUST use the function aes since the color of
the points is data dependent and not a constant as it was in the previous example.
Now we’ll modify the aesthetics so that color indicates which drv type each point represents.
Again, use g and add to it a call to the function geom_point with 3 arguments. The first is size
set equal to 4, the second is alpha equal to 1/2. The third is a call to the function aes with
the argument color set equal to drv. Note that you MUST use the function aes since the color of
the points is data dependent and not a constant as it was in the previous example.
g+geom_point(aes(color = drv), size = 4, alpha = 1/2)
Your dedication is inspiring!
|============================================ | 50%
Notice the helpful legend on the right decoding the relationship between color and drv.
…
|============================================== | 52%
Now we’ll practice modifying labels. Again, we’ll use g and add to it calls to 3 functions.
First, add a call to geom_point with an argument making the color dependent on the drv type (as
we did in the previous example). Second, add a call to the function labs with the argument title
set equal to “Swirl Rules!”. Finally, add a call to labs with 2 arguments, one setting x equal
to “Displacement” and the other setting y equal to “Hwy Mileage”.
g + geom_point(aes(color = drv)) + labs(title=“Swirl Rules!”) + labs(x=“Displacement”,y=“Hwy Mileage”)
Note that you could have combined the two calls to the function labs in the previous example.
Now we’ll practice customizing the geom_smooth calls. Use g and add to it a call to geom_point
setting the color to drv type (remember to use the call to the aes function), size set to 2 and
alpha to 1/2. Then add a call to geom_smooth with 4 arguments. Set size equal to 4, linetype to
3, method to “lm”, and se to FALSE.
g + geom_point(aes(color = drv),size =2, alpha = 1/2) + geom_smooth(size = 4, linetype = 3, method = “lm”, se = FALSE)
What did these arguments do? The method specified a linear regression (note the negative slope
indicating that the bigger the displacement the lower the gas mileage), the linetype specified
that it should be dashed (not continuous), the size made the dashes big, and the se flag told
ggplot to turn off the gray shadows indicating standard errors (confidence intervals).
Finally, let’s do a simple plot using the black and white theme, theme_bw. Specify g and add a
call to the function geom_point with the argument setting the color to the drv type. Then add a
call to the function theme_bw with the argument base_family set equal to “Times”. See if you
notice the difference.
g + geom_point(aes(color = drv)) + theme_bw(base_family=“Times”)
One final note before we go through a more complicated, layered ggplot example, and this
concerns the limits of the axes. We’re pointing this out to emphasize a subtle difference
between ggplot and the base plotting function plot.
We’ll close with a more complicated example to show you the full power of ggplot and the entire
ggplot2 package. We’ll continue to work with the mpg dataset.
…
|======================================================================== | 81%
Start by creating the graphical object g by assigning to it a call to ggplot with 2 arguments.
The first is the dataset and the second is a call to the function aes. This call will have 3
arguments, x set equal to displ, y set equal to hwy, and color set equal to factor(year). This
last will allow us to distinguish between the two manufacturing years (1999 and 2008) in our
data.
g <- ggplot(mpg,aes(x=displ,y=hwy,color=factor(year)))
We’ll build the plot up step by step. First add to g a call to the function geom_point with 0
arguments.
g+geom_point()
A simple, yet comfortingly familiar scatterplot appears. Let’s make our display a 2 dimensional
multi-panel plot. Recall your last command (with the up arrow) and add to it a call the function
facet_grid. Give it 2 arguments. The first is the formula drv~cyl, and the second is the
argument margins set equal to TRUE. Try this now.
g+geom_point()+facet_grid(drv~cyl,margins=TRUE)
A 4 by 5 plot, huh? The margins argument tells ggplot to display the marginal totals over each
row and column, so instead of seeing 3 rows (the number of drv factors) and 4 columns (the
number of cyl factors) we see a 4 by 5 display. Note that the panel in position (4,5) is a tiny
version of the scatterplot of the entire dataset.
…
|=================================================================================== | 94%
Now add to your last command (or retype it if you like to type) a call to geom_smooth with 4
arguments. These are method set to “lm”, se set to FALSE, size set to 2, and color set to
“black”.
Now add to your last command (or retype it if you like to type) a call to geom_smooth with 4
arguments. These are method set to “lm”, se set to FALSE, size set to 2, and color set to
“black”.
g+geom_point()+facet_grid(drv~cyl,margins=TRUE)+geom_smooth(method = “lm”,se = FALSE,size=2,color=“black”)
Angry Birds? Finally, add to your last command (or retype it if you like to type) a call to the
function labs with 3 arguments. These are x set to “Displacement”, y set to “Highway Mileage”,
and title set to “Swirl Rules!”.
g+geom_point()+facet_grid(drv~cyl,margins=TRUE)+geom_smooth(method = “lm”,se = FALSE,size=2,color=“black”)+labs(x=“Displacement”,y=“Highway Mileage”,title=“Swirl Rules!”)
In this lesson we’ll go through a few more qplot examples using diamond data which comes with
the ggplot2 package. This data is a little more complicated than the mpg data and it contains
information on various characteristics of diamonds.
Now let’s plot a histogram of the price of the 53940 diamonds in this dataset. Recall that a
histogram requires only one variable of the data, so run the R command qplot with the first
argument price and the argument data set equal to diamonds. This will show the frequency of
different diamond prices.
qplot(price,data=diamonds)
Not only do you get a histogram, but you also get a message about the binwidth defaulting to
range/30. Recall that range refers to the spread or dispersion of the data, in this case price
of diamonds. Run the R command range now with diamonds$price as its argument.
stat_bin() using bins = 30. Pick better value with binwidth.
You are amazing!
|========== | 11%
Not only do you get a histogram, but you also get a message about the binwidth defaulting to
range/30. Recall that range refers to the spread or dispersion of the data, in this case price
of diamonds. Run the R command range now with diamonds$price as its argument.
range(diamonds$price) [1] 326 18823
Keep up the great work!
|============ | 13%
We see that range returned the minimum and maximum prices, so the diamonds vary in price from
$326 to $18823. We’ve done the arithmetic for you, the range (difference between these two
numbers) is $18497.
…
|============= | 15%
Rerun qplot now with 3 arguments. The first is price, the second is data set equal to diamonds,
and the third is binwidth set equal to 18497/30). (Use the up arrow to save yourself some
typing.) See if the plot looks familiar.
qplot(price,data=diamonds,binwidth=18497/30)
That’s the answer I was looking for.
|=============== | 17%
No more messages in red, but a histogram almost identical to the previous one! If you typed
18497/30 at the command line you would get the result 616.5667. This means that the height of
each bin tells you how many diamonds have a price between x and x+617 where x is the left edge
of the bin.
We’ve created a vector containing integers that are multiples of 617 for you. It’s called brk.
Look at it now.
brk [1] 0 617 1234 1851 2468 3085 3702 4319 4936 5553 6170 6787 7404 8021 8638 [16] 9255 9872 10489 11106 11723 12340 12957 13574 14191 14808 15425 16042 16659 17276 17893 [31] 18510 19127
Perseverance, that’s the answer.
|================== | 20%
We’ve also created a vector containing the number of diamonds with prices between each pair of
adjacent entries of brk. For instance, the first count is the number of diamonds with prices
between 0 and $617, and the second is the number of diamonds with prices between $617 and $1234.
Look at the vector named counts now.
counts [1] 4611 13255 5230 4262 3362 2567 2831 2841 2203 1666 1445 1112 987 766 796 [16] 655 606 553 540 427 429 376 348 338 298 305 269 287 227 251 [31] 97
You are really on a roll!
|==================== | 22%
See how it matches the histogram you just plotted? So, qplot really works!
…
|===================== | 24%
You’re probably sick of it but rerun qplot again, this time with 4 arguments. The first 3 are
the same as the last qplot command you just ran (price, data set equal to diamonds, and binwidth
set equal to 18497/30). (Use the up arrow to save yourself some typing.) The fourth argument is
fill set equal to cut. The shape of the histogram will be familiar, but it will be more
colorful.
This shows how the counts within each price grouping (bin) are distributed among the different
cuts of diamonds. Notice how qplot displays these distributions relative to the cut legend on
the right. The fair cut diamonds are at the bottom of each bin, the good cuts are above them,
then the very good above them, until the ideal cuts are at the top of each bin. You can quickly
see from this display that there are very few fair cut diamonds priced above $5000.
Now we’ll replot the histogram as a density function which will show the proportion of diamonds | in each bin. This means that the shape will be similar but the scale on the y-axis will be | different since, by definition, the density function is nonnegative everywhere, and the area | under the curve is one. To do this, simply call qplot with 3 arguments. The first 2 are price | and data (set equal to diamonds). The third is geom which should be set equal to the string | “density”. Try this now.
Now we’ll replot the histogram as a density function which will show the proportion of diamonds
in each bin. This means that the shape will be similar but the scale on the y-axis will be
different since, by definition, the density function is nonnegative everywhere, and the area
under the curve is one. To do this, simply call qplot with 3 arguments. The first 2 are price
and data (set equal to diamonds). The third is geom which should be set equal to the string
“density”. Try this now.
qplot(price,data=diamonds,geom=“density”)
Notice that the shape is similar to that of the histogram we saw previously. The highest peak is
close to 0 on the x-axis meaning that most of the diamonds in the dataset were inexpensive. In
general, as prices increase (move right along the x-axis) the number of diamonds (at those
prices) decrease. The exception to this is when the price is around $4000; there’s a slight
increase in frequency. Let’s see if cut is responsible for this increase.
…
|============================ | 31%
Rerun qplot, this time with 4 arguments. The first 2 are the usual, and the third is geom set
equal to “density”. The fourth is color set equal to cut. Try this now.
qplot(price,data=diamonds,geom=“density”,color=cut)
See how easily qplot did this? Four of the five cuts have 2 peaks, one at price $1000 and the | other between $4000 and $5000. The exception is the Fair cut which has a single peak at $2500. | This gives us a little more understanding of the histogram we saw before.
Let’s move on to scatterplots. For these we’ll need to specify two variables from the diamond
dataset.
…
|================================= | 37%
Let’s start with carat and price. Use these as the first 2 arguments of qplot. The third should
be data set equal to the dataset. Try this now.
qplot(carat,price,data=diamonds)
We see the positive trend here, as the number of carats increases the price also goes up.
Now rerun the same command, except add a fourth parameter, shape, set equal to cut.
qplot(carat,price,data=diamonds,shape=cut + )
That’s correct!
|====================================== | 43%
The same scatterplot appears, except the cuts of the diamonds are distinguished by different
symbols. The legend at the right tells you which symbol is associated with each cut. These are
small and hard to read, so rerun the same command, except this time instead of setting the
argument shape equal to cut, set the argument color equal to cut.
Now rerun the same command, except add a fourth parameter, shape, set equal to cut.
qplot(carat,price,data=diamonds,shape=cut + )
That’s correct!
|====================================== | 43%
The same scatterplot appears, except the cuts of the diamonds are distinguished by different
symbols. The legend at the right tells you which symbol is associated with each cut. These are
small and hard to read, so rerun the same command, except this time instead of setting the
argument shape equal to cut, set the argument color equal to cut.
qplot(carat,price,data=diamonds,color=cut)
We’ll rerun the plot you just did (carat,price,data=diamonds and color=cut) but add an
additional parameter. Use geom_smooth with the method set equal to the string “lm”.
qplot(carat,price,data=diamonds,color=cut)+geom_smooth(method=“lm”)
Again, we see the same scatterplot, but slightly more compressed and showing 5 regression lines, one for each cut of diamonds. It might be hard to see, but around each line is a shadow showing the 95% confidence interval. We see, unsurprisingly, that the better the cut, the steeper (more positive) the slope of the lines.
Finally, let’s rerun that plot you just did qplot(carat,price,data=diamonds, color=cut) + | geom_smooth(method=“lm”) but add one (just one) more argument to qplot. The new argument is | facets and it should be set equal to the formula .~cut. Recall that the facets argument | indicates we want a multi-panel plot. The symbol to the left of the tilde indicates rows (in | this case just one) and the symbol to the right of the tilde indicates columns (in this five, | the number of cuts). Try this now.
qplot(carat,price,data=diamonds,color=cut,facets = .~cut)+geom_smooth(method=“lm”)
First create a graphical object g by assigning to it the output of a call to the function ggplot
with 2 arguments. The first is the dataset diamonds and the second is a call to the function aes
with 2 arguments, depth and price. Remember you won’t see any result.
g <- ggplot(diamonds,aes(x=depth,y=price))
Type g+geom_point(alpha=1/3) at the command prompt.
g+geom_point(alpha=1/3)
That’s somewhat interesting. We see that depth ranges from 43 to 79, but the densest
distribution is around 60 to 65. Suppose we want to see if this relationship (between depth and
price) is affected by cut or carat. We know cut is a factor with 5 levels (Fair, Good, Very
Good, Premium, and Ideal). But carat is numeric and not a discrete factor. Can we do this?
Of course! That’s why we asked. R has a handy command, cut, which allows you to divide your data
into sets and label each entry as belonging to one of the sets, in effect creating a new factor.
First, we’ll have to decide where to cut the data.
Let’s divide the data into 3 pockets, so 1/3 of the data falls into each. We’ll use the R
command quantile to do this. Create the variable cutpoints and assign to it the output of a call
to the function quantile with 3 arguments. The first is the data to cut, namely diamonds$carat;
the second is a call to the R function seq. This is also called with 3 arguments, (0, 1, and
length set equal to 4). The third argument to the call to quantile is the boolean na.rm set
equal to TRUE.
Type cutpoints <- quantile(diamonds$carat,seq(0,1,length=4),na.rm=TRUE) at the command prompt.
cutpoints <- quantile(diamonds$carat,seq(0,1,length=4),na.rm=TRUE)
Look at cutpoints now to understand what it is.
cutpoints 0% 33.33333% 66.66667% 100% 0.20 0.50 1.00 5.01
That’s correct!
|======================================================================= | 80%
We see a 4-long vector (explaining why length was set equal to 4). We also see that .2 is the
smallest carat size in the dataset and 5.01 is the largest. One third of the diamonds are
between .2 and .5 carats and another third are between .5 and 1 carat in size. The remaining
third are between 1 and 5.01 carats. Now we can use the R command cut to label each of the 53940
diamonds in the dataset as belonging to one of these 3 factors. Create a new name in diamonds,
diamonds$car2 by assigning it the output of the call to cut. This command takes 2 arguments,
diamonds$carat, which is what we want to cut, and cutpoints, the places where we’ll cut.
diamonds\(car2 <- cut(diamonds\)carat,cutpoints)
Perseverance, that’s the answer.
|========================================================================= | 81%
Now we can continue with our multi-facet plot. First we have to reset g since we changed the
dataset (diamonds) it contained (by adding a new column). Assign to g the output of a call to
ggplot with 2 arguments. The dataset diamonds is the first, and a call to the function aes with
2 arguments (depth,price) is the second.
Now we can continue with our multi-facet plot. First we have to reset g since we changed the
dataset (diamonds) it contained (by adding a new column). Assign to g the output of a call to
ggplot with 2 arguments. The dataset diamonds is the first, and a call to the function aes with
2 arguments (depth,price) is the second.
g <- ggplot(diamonds,aes(x=depth,y=price))
Excellent job!
|========================================================================== | 83%
Now add to g calls to 2 functions. This first is a call to geom_point with the argument alpha
set equal to 1/3. The second is a call to the function facet_grid using the formula cut ~ car2
as its argument.
g+geom_point(alpha=1/3)+facet_grid(cut ~ car2)
We see a multi-facet plot with 5 rows, each corresponding to a cut factor. Not surprising. What
is surprising is the number of columns. We were expecting 3 and got 4. Why?
…
|============================================================================= | 87%
The first 3 columns are labeled with the cutpoint boundaries. The fourth is labeled NA and shows
us where the data points with missing data (NA or Not Available) occurred. We see that there
were only a handful (12 in fact) and they occurred in Very Good, Premium, and Ideal cuts. We
created a vector, myd, containing the indices of these datapoints. Look at these entries in
diamonds by typing the expression diamonds[myd,]. The myd tells R what rows to show and the
empty column entry says to print all the columns.
The first 3 columns are labeled with the cutpoint boundaries. The fourth is labeled NA and shows
us where the data points with missing data (NA or Not Available) occurred. We see that there
were only a handful (12 in fact) and they occurred in Very Good, Premium, and Ideal cuts. We
created a vector, myd, containing the indices of these datapoints. Look at these entries in
diamonds by typing the expression diamonds[myd,]. The myd tells R what rows to show and the
empty column entry says to print all the columns.
diamonds[myd,] A tibble: 12 x 12 carat cut color clarity depth table price x y z carat2 car2 1 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27 NA NA 2 0.2 Premium E VS2 59.8 62 367 3.79 3.77 2.26 NA NA 3 0.2 Premium E VS2 59.0 60 367 3.81 3.78 2.24 NA NA 4 0.2 Premium E VS2 61.1 59 367 3.81 3.78 2.32 NA NA 5 0.2 Premium E VS2 59.7 62 367 3.84 3.80 2.28 NA NA 6 0.2 Ideal E VS2 59.7 55 367 3.86 3.84 2.30 NA NA 7 0.2 Premium F VS2 62.6 59 367 3.73 3.71 2.33 NA NA 8 0.2 Ideal D VS2 61.5 57 367 3.81 3.77 2.33 NA NA 9 0.2 Very Good E VS2 63.4 59 367 3.74 3.71 2.36 NA NA 10 0.2 Ideal E VS2 62.2 57 367 3.76 3.73 2.33 NA NA 11 0.2 Premium D VS2 62.3 60 367 3.73 3.68 2.31 NA NA 12 0.2 Premium D VS2 61.7 60 367 3.77 3.72 2.31 NA NA
All that practice is paying off!
|=============================================================================== | 89%
We see these entries match the plots. Whew - that’s a relief. The car2 field is, in fact, NA for
these entries, but the carat field shows they each had a carat size of .2. What’s going on here?
…
|================================================================================= | 91%
Actually our plot answers this question. The boundaries for each column appear in the gray
labels at the top of each column, and we see that the first column is labeled (0.2,0.5]. This
indicates that this column contains data greater than .2 and less than or equal to .5. So
diamonds with carat size .2 were excluded from the car2 field.
Finally, recall the last plotting command (g+geom_point(alpha=1/3)+facet_grid(cut~car2)) or
retype it if you like and add another call. This one to the function geom_smooth. Pass it 3
arguments, method set equal to the string “lm”, size set equal to 3, and color equal to the
string “pink”.
g+geom_point(alpha=1/3)+facet_grid(cut ~ car2)+geom_smooth(method=“lm”,size=3,color=“pink”)
Nice thick regression lines which are somewhat interesting. You can add labels to the plot if
you want but we’ll let you experiment on your own.
…
|====================================================================================== | 96%
Lastly, ggplot2 can, of course, produce boxplots. This final exercise is the sum of 3 function
calls. The first call is to ggplot with 2 arguments, diamonds and a call to aes with carat and
price as arguments. The second call is to geom_boxplot with no arguments. The third is to
facet_grid with one argument, the formula . ~ cut. Try this now.
ggplot(diamonds,aes(x=carat,y=price))+geom_boxplot()+facet_grid(. ~ cut)
Yes! A boxplot looking like marshmallows about to be roasted. Well done and congratulations!
You’ve finished this jewel of a lesson. Hope it payed off!
EJERCICIOS
To give you an idea of what we’re talking about, consider these random points we generated.
We’ll use them to demonstrate hierarchical clustering in this lesson. We’ll do this in several
steps, but first we have to clarify our terms and concepts.
It’s pretty obvious that out of the 4 choices, the pair 5 and 6 were the closest together.
However, there are several ways to measure distance or similarity. Euclidean distance and
correlation similarity are continuous measures, while Manhattan distance is a binary measure. In
this lesson we’ll just briefly discuss the first and last of these. It’s important that you use
a measure of distance that fits your problem.
Euclidean distance is what you learned about in high school algebra. Given two points on a
plane, (x1,y1) and (x2,y2), the Euclidean distance is the square root of the sums of the squares
of the distances between the two x-coordinates (x1-x2) and the two y-coordinates (y1-y2). You
probably recognize this as an application of the Pythagorean theorem which yields the length of
the hypotenuse of a right triangle.
…
|============== | 16%
It shouldn’t be hard to believe that this generalizes to more than two dimensions as shown in
the formula at the bottom of the picture shown here.
In this case, we can use Manhattan or city block distance (also known as a taxicab metric). This | picture, copied from http://en.wikipedia.org/wiki/Taxicab_geometry, shows what this means.
…
|=================== | 21%
You want to travel from the point at the lower left to the one on the top right. The shortest
distance is the Euclidean (the green line), but you’re limited to the grid, so you have to
follow a path similar to those shown in red, blue, or yellow. These all have the same length
(12) which is the number of small gray segments covered by their paths.
Run the R command dist with the argument dataFrame to compute the distances between all pairs of
these points. By default dist uses Euclidean distance as its metric, but other metrics such as
Manhattan, are available. Just use the default.
dist(dataFrame)
So 0.0815 (units are unspecified) between points 5 and 6 is the shortest distance. We can put these points in a
single cluster and look for another close pair of points.
So 10 and 11 are another pair of points that would be in a second cluster. We’ll start creating our dendrogram now.
Here’re the original plot and two beginning pieces of the dendrogram.
hclust() and takes as an | argument the pairwise distance matrix which we looked at before. We’ve stored this matrix for you in a variable | called distxy. Run hclust now with distxy as its argument and put the result in the variable hc.
hc <- hclust(distxy)
Call the R function plot with one argument, hc.
plot(hc)
Nice plot, right? R’s plot conveniently labeled everything for you. The points we saw are the leaves at the bottom
of the graph, 5 and 6 are connected, as are 10 and 11. Moreover, we see that the original 3 groupings of points are
closest together as leaves on the picture. That’s reassuring. Now call plot again, this time with the argument
as.dendrogram(hc).
class(hc) [1] “hclust”
plot(as.dendrogram(hc))
The essentials are the same, but the labels are missing and the leaves (original points) are all printed at the same
level. Notice that the vertical heights of the lines and labeling of the scale on the left edge give some indication
of distance. Use the R command abline to draw a horizontal blue line at 1.5 on this plot. Recall that this requires
2 arguments, h=1.5 and col=“blue”.
We see that this blue line intersects 3 vertical lines and this tells us that using the distance 1.5 (unspecified
units) gives us 3 clusters (1 through 4), (9 through 12), and (5 through 8). We call this a “cut” of our dendrogram.
Now cut the dendrogam by drawing a red horizontal line at .4.
So the number of clusters in your data depends on where you draw the line! (We said there’s a lot of flexibility
here.) Now that we’ve seen the practice, let’s go back to some “theory”. Notice that the two original groupings, 5
through 8, and 9 through 12, are connected with a horizontal line near the top of the display. You’re probably
wondering how distances between clusters of points are measured.
…
|=============================================== | 53%
There are several ways to do this. We’ll just mention two. The first is called complete linkage and it says that if
you’re trying to measure a distance between two clusters, take the greatest distance between the pairs of points in
those two clusters. Obviously such pairs contain one point from each cluster.
COMPLETE LINKAGE …
|================================================= | 55%
So if we were measuring the distance between the two clusters of points (1 through 4) and (5 through 8), using
complete linkage as the metric we would use the distance between points 4 and 8 as the measure since this is the
largest distance between the pairs of those groups.
For 3 clusters wold be llike this
We’ve created the dataframe dFsm for you containing these 3 points, 4, 8, and 11. Run dist on dFsm to see what the
smallest distance between these 3 points is.
dist(dFsm)
Nice work!
|======================================================= | 61%
We see that the smallest distance is between points 2 and 3 in this reduced set, (these are actually points 8 and 11
in the original set), indicating that the two clusters these points represent ((5 through 8) and (9 through 12)
respectively) would be joined (at a distance of 1.869) before being connected with the third cluster (1 through 4).
This is consistent with the dendrogram we plotted.
AVERAGE LINKAGE
The second way to measure a distance between two clusters that we’ll just mention is called average linkage. First
you compute an “average” point in each cluster (think of it as the cluster’s center of gravity). You do this by
computing the mean (average) x and y coordinates of the points in the cluster.
…
|========================================================= | 65%
Then you compute the distances between each cluster average to compute the intercluster distance.
…
|=========================================================== | 66%
Now look at the hierarchical cluster we created before, hc.
Now look at the hierarchical cluster we created before, hc.
hc
Call: hclust(d = distxy)
Cluster method : complete Distance : euclidean Number of objects: 12
In our simple set of data, the average and complete linkages aren’t that different, but in more complicated datasets
the type of linkage you use could affect how your data clusters. It is a good idea to experiment with different
methods of linkage to see the varying ways your data groups. This will help you determine the best way to continue
with your analysis.
HEATMAPS
R provides a handy function to produce heat maps. It’s called heatmap. We’ve put the point data we’ve been using
throughout this lesson in a matrix. Call heatmap now with 2 arguments. The first is dataMatrix and the second is col
set equal to cm.colors(25). This last is optional, but we like the colors better than the default ones.
heatmap(dataMatrix,col=cm.colors(25))
We see an interesting display of sorts. This is a very simple heat map - simple because the data isn’t very complex.
The rows and columns are grouped together as shown by colors. The top rows (labeled 5, 6, and 7) seem to be in the
same group (same colors) while 8 is next to them but colored differently. This matches the dendrogram shown on the
left edge. Similarly, 9, 12, 11, and 10 are grouped together (row-wise) along with 3 and 2. These are followed by 1
and 4 which are in a separate group. Column data is treated independently of rows but is also grouped.
This looks slightly more interesting than the heatmap for the point data. It shows a little better how the rows and
columns are treated (clustered and colored) independently of one another. To understand the disparity in color
(between the left 4 columns and the right 2) look at mt now.
See how four of the columns are all relatively small numbers and only two (disp and hp) are large? That explains the
big difference in color columns. Now to understand the grouping of the rows call plot with one argument, the
dendrogram object denmt we’ve created for you.
plot(denmt)
We see that this dendrogram is the one displayed at the side of the heat map. How was this created? Recall that we
generalized the distance formula for more than 2 dimensions. We’ve created a distance matrix for you, distmt. Look
at it now.
dist(mt)
See how these distances match those in the dendrogram? So hclust really works! Let’s review now.
KMEANS
In this lesson we’ll learn about k-means clustering, another simple way of examining and | organizing multi-dimensional data. As with hierarchical clustering, this technique is most | useful in the early stages of analysis when you’re trying to get an understanding of the data, | e.g., finding some pattern or relationship between different factors or variables.
Since clustering organizes data points that are close into groups we’ll assume we’ve decided on | a measure of distance, e.g., Euclidean.
As we said, k-means is a partioning approach which requires that you first guess how many | clusters you have (or want). Once you fix this number, you randomly create a “centroid” (a | phantom point) for each cluster and assign each point or observation in your dataset to the | centroid to which it is closest. Once each point is assigned a centroid, you readjust the | centroid’s position by making it the average of the points assigned to it.
When it’s finished k-means clustering returns a final position of each cluster’s centroid as
well as the assignment of each data point or observation to a cluster.
Now we’ll step through this process using our random points as our data. The coordinates of
these are stored in 2 vectors, x and y. We eyeball the display and guess that there are 3
clusters. We’ll pick 3 positions of centroids, one for each cluster.
We’ve created two 3-long vectors for you, cx and cy. These respectively hold the x- and y-
coordinates for 3 proposed centroids. For convenience, we’ve also stored them in a 2 by 3 matrix
cmat. The x coordinates are in the first row and the y coordinates in the second. Look at cmat
now.
cmat [,1] [,2] [,3][1,] 1 1.8 2.5 [2,] 2 1.0 1.5
Excellent work!
|===================== | 24%
The coordinates of these points are (1,2), (1.8,1) and (2.5,1.5). We’ll add these centroids to
the plot of our points. Do this by calling the R command points with 6 arguments. The first 2
are cx and cy, and the third is col set equal to the concatenation of 3 colors, “red”, “orange”,
and “purple”. The fourth argument is pch set equal to 3 (a plus sign), the fifth is cex set
equal to 2 (expansion of character), and the final is lwd (line width) also set equal to 2.
points(cx,cy,col=c(“red”,“orange”,“purple”),pch=3,cex=2,lwd=2)
Now we have to calculate distances between each point and every centroid. There are 12 data
points and 3 centroids. How many distances do we have to calculate?
1: 15 2: 36 3: 108 4: 9
Selection: 2
You got it!
|=========================== | 30%
We’ve written a function for you called mdist which takes 4 arguments. The vectors of data
points (x and y) are the first two and the two vectors of centroid coordinates (cx and cy) are
the last two. Call mdist now with these arguments.
We’ve stored these distances in the matrix distTmp for you. Now we have to assign a cluster to
each point. To do that we’ll look at each column and pick the minimum entry
R has a handy function which.min which you can apply to ALL the columns of distTmp with one
call. Simply call the R function apply with 3 arguments. The first is distTmp, the second is 2
meaning the columns of distTmp, and the third is which.min, the function you want to apply to
the columns of distTmp. Try this now.
apply(distTmp,2,which.min) [1] 2 2 2 1 3 3 3 1 3 3 3 3
We’ve stored the vector of cluster colors (“red”,“orange”,“purple”) in the array cols1 for you
and we’ve also stored the cluster assignments in the array newClust. Let’s color the 12 data
points according to their assignments. Again, use the command points with 5 arguments. The first
2 are x and y. The third is pch set to 19, the fourth is cex set to 2, and the last, col is set
to cols1[newClust].
points(x,y,pch=19,cex=2,col=cols1[newClust])
We can use the R function tapply which applies “a function over a ragged array”. This means that
every element of the array is assigned a factor and the function is applied to subsets of the
array (identified by the factor vector). This allows us to take advantage of the factor vector
newClust we calculated. Call tapply now with 3 arguments, x (the data), newClust (the factor
array), and mean (the function to apply).
tapply(x,newClust,mean) 1 2 3 1.210767 1.010320 2.498011
Keep up the great work!
|========================================= | 46%
Repeat the call, except now apply it to the vector y instead of x.
tapply(y,newClust,mean) 1 2 3 1.730555 1.016513 1.354373
Great job!
|=========================================== | 48%
Now that we have new x and new y coordinates for the 3 centroids we can plot them. We’ve stored
off the coordinates for you in variables newCx and newCy. Use the R command points with these as
the first 2 arguments. In addition, use the arguments col set equal to cols1, pch equal to 8,
cex equal to 2 and lwd also equal to 2.
points(newCx,newCy,col=cols1,pch=8,cex=2,lwd=2)
We see how the centroids have moved closer to their respective clusters. This is especially true
of the second (orange) cluster. Now call the distance function mdist with the 4 arguments x, y,
newCx, and newCy. This will allow us to reassign the data points to new clusters if necessary
We’ve stored off this new matrix of distances in the matrix distTmp2 for you. Recall that the
first cluster is red, the second orange and the third purple. Look closely at columns 4 and 7 of
distTmp2. What will happen to points 4 and 7?
1: They will both change clusters 2: Nothing 3: They will both change to cluster 2 4: They’re the only points that won’t change clusters
Selection: 1
All that hard work is paying off!
|================================================ | 54%
Now call apply with 3 arguments, distTmp2, 2, and which.min to find the new cluster assignments
for the points.
apply(distTmp2,2,which.min) [1] 2 2 2 2 3 3 1 1 3 3 3 3
Keep up the great work!
|================================================== | 56%
We’ve stored off the new cluster assignments in a vector of factors called newClust2. Use the R
function points to recolor the points with their new assignments. Again, there are 5 arguments,
x and y are first, followed by pch set to 19, cex to 2, and col to cols1[newClust2].
points(x,y,pch=19,cex=2,col=cols1[newClust2])
You’re the best!
|==================================================== | 58%
Notice that points 4 and 7 both changed clusters, 4 moved from 1 to 2 (red to orange), and point
7 switched from 3 to 2 (purple to red).
Now use tapply to find the x coordinate of the new centroid. Recall there are 3 arguments, x,
newClust2, and mean.
tapply(x,newClust2,mean)
1 2 3
1.8878628 0.8904553 2.6001704
Do the same to find the new y coordinate.
tapply(y,newClust2,mean)
1 2 3
2.157866 1.006871 1.274675
Perseverance, that’s the answer.
|========================================================= | 64%
We’ve stored off these coordinates for you in the variables finalCx and finalCy. Plot these new
centroids using the points function with 6 arguments. The first 2 are finalCx and finalCy. The
argument col should equal cols1, pch should equal 9, cex 2 and lwd 2.
points(finalCx,finalCy,col=cols1,pch=9,cex=2,lwd=2)
You got it!
|=========================================================== | 66%
It should be obvious that if we continued this process points 5 through 8 would all turn red,
while points 1 through 4 stay orange, and points 9 through 12 purple.
Now that you’ve gone through an example step by step, you’ll be relieved to hear that R provides
a command to do all this work for you. Unsurprisingly it’s called kmeans and, although it has
several parameters, we’ll just mention four. These are x, (the numeric matrix of data), centers,
iter.max, and nstart. The second of these (centers) can be either a number of clusters or a set
of initial centroids. The third, iter.max, specifies the maximum number of iterations to go
through, and nstart is the number of random starts you want to try if you specify centers as a
number.
The program returns the information that the data clustered into 3 clusters each of size 4. It
also returns the coordinates of the 3 cluster means, a vector named cluster indicating how the
12 points were partitioned into the clusters, and the sum of squares within each cluster. It
also shows all the available components returned by the function. We’ve stored off this data for
you in a kmeans object called kmObj. Look at kmObj$iter to see how many iterations the algorithm
went through.
kmObj$iter [1] 2
All that hard work is paying off!
|================================================================== | 74%
Two iterations as we did before. We just want to emphasize how you can access the information
available to you. Let’s plot the data points color coded according to their cluster. This was
stored in kmObj$cluster. Run plot with 5 arguments. The data, x and y, are the first two; the
third, col is set equal to kmObj$cluster, and the last two are pch and cex. The first of these
should be set to 19 and the last to 2.
plot(x,y,col=kmObj$cluster,pch=19,cex=2)
Now add the centroids which are stored in kmObj$centers. Use the points function with 5
arguments. The first two are kmObj$centers and col=c(“black”,“red”,“green”). The last three,
pch, cex, and lwd, should all equal 3
Now for some fun! We want to show you how the output of the kmeans function is affected by its | random start (when you just ask for a number of clusters). With random starts you might want to | run the function several times to get an idea of the relationships between your observations. | We’ll call kmeans with the same data points (stored in dataFrame), but ask for 6 clusters | instead of 3.
We’ll plot our data points several times and each time we’ll just change the argument col which
will show us how the R function kmeans is clustering them. So, call plot now with 5 arguments.
The first 2 are x and y. The third is col set equal to the call kmeans(dataFrame,6)$cluster. The
last two (pch and cex) are set to 19 and 2 respectively.
plot(x,y,col=kmeans(dataFrame,6)$cluster,pch=19,cex=2)
See how the points cluster? Now recall your last command and rerun it.
plot(x,y,col=kmeans(dataFrame,6)$cluster,pch=19,cex=2)
plot(x,y,col=kmeans(dataFrame,6)$cluster,pch=19,cex=2)
So the clustering changes with different starts. Perhaps 6 is too many clusters? Let’s review!
DIMENSION REDUCTION PCA/SVD
In this lesson we’ll discuss principal component analysis (PCA) and singular value decomposition (SVD), two important | and related techniques of dimension reduction. This last entails processes which finding subsets of variables in | datasets that contain their essences. PCA and SVD are used in both the exploratory phase and the more formal | modelling stage of analysis. We’ll focus on the exploratory phase and briefly touch on some of the underlying theory.
We’ll begin with a motivating example - random data.
This is dataMatrix, a matrix of 400 random normal numbers (mean 0 and standard deviation 1). We’re displaying it
with the R command image. Run the R command head with dataMatrix as its argument to see what dataMatrix looks like.
heatmap(dataMatrix)
We can see that even with the clustering that heatmap provides, permuting the rows (observations) and columns
(variables) independently, the data still looks random.
Let’s add a pattern to the data. We’ve put some R code in the file addPatt.R for you. Run the command myedit with the
single argument “addPatt.R” (make sure to use the quotation marks) to see the code. You might have to click your
cursor in the console after you do this to keep from accidentally changing the file.
set.seed(678910)
for(i in 1:40){ # flip a coin coinFlip <- rbinom(1,size=1,prob=0.5) # if coin is heads add a common pattern to that row, add 0 to first 5 columns and 3 to last 5
if(coinFlip){ dataMatrix[i,] <- dataMatrix[i,] + rep(c(0,3),each=5) } }
So in rows affected by the coin flip, the 5 left columns will still have a mean of 0 but the right 5 columns will
have a mean closer to 3.
Here’s the image of the altered dataMatrix after the pattern has been added. The pattern is clearly visible in the
columns of the matrix. The right half is yellower or hotter, indicating higher values in the matrix.
Now run the R command heatmap again with dataMatrix as its only argument. This will perform a hierarchical cluster
analysis on the matrix.
heatmap(dataMatrix)
Again we see the pattern in the columns of the matrix. As shown in the dendrogram at the top of the display, these
split into 2 clusters, the lower numbered columns (1 through 5) and the higher numbered ones (6 through 10). Recall
from the code in addPatt.R that for rows selected by the coinflip the last 5 columns had 3 added to them. The rows
still look random.
Now consider this picture. On the left is an image similar to the heatmap of dataMatix you just plotted. It is an
image plot of the output of hclust(), a hierarchical clustering function applied to dataMatrix. Yellow indicates
“hotter” or higher values than red. This is consistent with the pattern we applied to the data (increasing the values
for some of the rightmost columns).
The middle display shows the mean of each of the 40 rows (along the x-axis). The rows are shown in the same order as
the rows of the heat matrix on the left. The rightmost display shows the mean of each of the 10 columns. Here the
column numbers are along the x-axis and their means along the y.
We see immediately the connection between the yellow (hotter) portion of the cluster image and the higher row means, | both in the upper right portion of the displays. Similarly, the higher valued column means are in the right half of | that display and lower colummn means are in the left half.
Now we’ll talk a little theory. Suppose you have 1000’s of multivariate variables X_1, … ,X_n. By multivariate we
mean that each X_i contains many components, i.e., X_i = (X_{i1}, … , X_{im}. However, these variables
(observations) and their components might be correlated to one another.
Which of the following would be an example of variables correlated to one another?
1: Heights and weights of members of a family 2: The depth of the Atlantic Ocean and what you eat for breakfast 3: Today’s weather and a butterfly’s wing position
Selection: 1
You are amazing!
|========================= | 23%
As data scientists, we’d like to find a smaller set of multivariate variables that are uncorrelated AND explain as
much variance (or variability) of the data as possible. This is a statistical approach.
In other words, we’d like to find the best matrix created with fewer variables (that is, a lower rank matrix) that
explains the original data. This is related to data compression.
Two related solutions to these problems are PCA which stands for Principal Component Analysis and SVD, Singular Value
Decomposition. This latter simply means that we express a matrix X of observations (rows) and variables (columns) as
the product of 3 other matrices, i.e., X=UDV^t. This last term (V^t) represents the transpose of the matrix V.
Here U and V each have orthogonal (uncorrelated) columns. U’s columns are the left singular vectors of X and V’s
columns are the right singular vectors of X. D is a diagonal matrix, by which we mean that all of its entries not on
the diagonal are 0. The diagonal entries of D are the singular values of X.
To illustrate this idea we created a simple example matrix called mat. Look at it now.
Now we’ll talk a little about PCA, Principal Component Analysis, “a simple, non-parametric method for extracting
relevant information from confusing data sets." We’re quoting here from a very nice concise paper on this subject
which can be found at
http://arxiv.org/pdf/1404.1100.pdf. The paper by Jonathon Shlens of Google Research is called,
A Tutorial on Principal Component Analysis.
…
|===================================== | 34%
Basically, PCA is a method to reduce a high-dimensional data set to its essential elements (not lose information) and
explain the variability in the data. We won’t go into the mathematical details here, (R has a function to perform
PCA), but you should know that SVD and PCA are closely related.
…
|====================================== | 35%
We’ll demonstrate this now. First we have to scale mat, our simple example data matrix. This means that we subtract
the column mean from every element and divide the result by the column standard deviation. Of course R has a command,
scale, that does this for you. Run svd on scale of mat.
To prove we’re not making this up, we’ve run svd on dataMatrix and stored the result in the object svd1. This has 3
components, d, u and v. look at the first column of V now. It can be viewed by using the svd1$v[,1] notation.
Here we again show the clustered data matrix on the left. Next to it we’ve plotted the first column of the U matrix
associated with the scaled data matrix. This is the first LEFT singular vector and it’s associated with the ROW means
of the clustered data. You can see the clear separation between the top 24 (around -0.2) row means and the bottom 16
(around 0.2). We don’t show them but note that the other columns of U don’t show this pattern so clearly.
The rightmost display shows the first column of the V matrix associated with the scaled and clustered data matrix.
This is the first RIGHT singular vector and it’s associated with the COLUMN means of the clustered data. You can see
the clear separation between the left 5 column means (between -0.1 and 0.1) and the right 5 column means (all below
-0.4). As with the left singular vectors, the other columns of V don’t show this pattern as clearly as this first one
does.
So the singular value decomposition automatically picked up these patterns, the differences in the row and column
means.
Why were the first columns of both the U and V matrices so special? Well as it happens, the D matrix of the SVD
explains this phenomenon. It is an aspect of SVD called variance explained. Recall that D is the diagonal matrix
sandwiched in between U and V^t in the SVD representation of the data matrix. The diagonal entries of D are like
weights for the U and V columns accounting for the variation in the data. They’re given in decreasing order from
highest to lowest. Look at these diagonal entries now. Recall that they’re stored in svd1$d.
To see this more closely, look at the first 2 columns of the v component. We stored the SVD output in the svd object
svd2.
So the first element which showed the difference between the left and right halves of the matrix accounts for roughly
50% of the variation in the matrix, and the second element which picked up the alternating pattern accounts for 18%
of the variance. The remaining elements account for smaller percentages of the variation. This indicates that the
first pattern is much stronger than the second. Also the two patterns confound each other so they’re harder to
separate and see clearly. This is what often happens with real data.
Now you’re probably convinced that SVD and PCA are pretty cool and useful as tools for analysis, but one problem with
them that you should be aware of, is that they cannot deal with MISSING data. Neither of them will work if any data
in the matrix is missing. (You’ll get error messages from R in red if you try.) Missing data is not unusual, so
luckily we have ways to work around this problem. One we’ll just mention is called imputing the data.
This uses the k nearest neighbors to calculate a values to use in place of the missing data. You may want to specify
an integer k which indicates how many neighbors you want to average to create this replacement value. The
bioconductor package (
http://bioconductor.org) has an impute package which you can use to fill in missing data. One
specific function in it is impute.knn.
We’ll move on now to a final example of the power of singular value decomposition and principal component analysis
and how they work as a data compression technique.
Consider this low resolution image file showing a face. We’ll use SVD and see how the first several components
contain most of the information in the file so that storing a huge matrix might not be necessary.
The image data is stored in the matrix faceData. Run the R command dim on faceData to see how big it is.
dim(faceData) [1] 32 32
You are amazing!
|================================================================================== | 75%
So it’s not that big of a file but we want to show you how to use what you learned in this lesson. We’ve done the SVD
and stored it in the object svd1 for you. Here’s the plot of the variance explained.
Recall that the data matrix X is the product of 3 matrices, that is X=UDV^t. These are precisely what you get when
you run svd on the matrix X.
…
|======================================================================================= | 80%
Suppose we create the product of pieces of these, say the first columns of U and V and the first element of D. The
first column of U can be interpreted as a 32 by 1 matrix (recall that faceData was a 32 by 32 matrix), so we can
multiply it by the first element of D, a 1 by 1 matrix, and get a 32 by 1 matrix result. We can multiply that by the
transpose of the first column of V, which is the first principal component. (We have to use the transpose of V’s
column to make it a 1 by 32 matrix in order to do the matrix multiplication properly.)
…
|========================================================================================= | 81%
Alas, that is how we do it in theory, but in R using only one element of d means it’s a constant. So we have to do
the matrix multiplication with the %*% operator and the multiplication by the constant (svd1$d[1]) with the regular
multiplication operator *.
…
|========================================================================================== | 82%
Try this now and put the result in the variable a1. Recall that svd1\(u, svd1\)d, and svd1$v contain all the
information you need. NOTE that because of the peculiarities of R’s casting, if you do the scalar multiplication with
the * operator first (before the matrix multiplication with the %*% operator) you MUST enclose the 2 arguments
(svd1\(u[,1] and svd1\)d[1]) in parentheses.
a1 <- svd1\(u[,1] %*% t(svd1\)v[,1]) * svd1$d[1]
Now to look at it as an image. We wrote a function for you called myImage which takes a single argument, a matrix of
data to display using the R function image. Run it now with a1 as its argument.
myImage(a1)
It might not look like much but it’s a good start. Now we’ll try the same experiment but this time we’ll use 2
elements from each of the 3 SVD terms.
Create the matrix a2 as the product of the first 2 columns of svd1$u, a diagonal matrix using the first 2 elements of
svd1\(d, and the transpose of the first 2 columns of svd1\)v. Since all of your multiplicands are matrices you have to
use only the operator %*% AND you DON’T need parentheses. Also, you must use the R function diag with svd1$d[1:2] as
its sole argument to create the proper diagonal matrix. Remember, matrix multiplication is NOT commutative so you
have to put the multiplicands in the correct order. Please use the 1:2 notation and not the c(m:n), i.e., the
concatenate function, when specifying the columns.
a2 <- svd1\(u[,1:2] %*% diag(svd1\)d[1:2]) %*% t(svd1$v[,1:2])
< myImage(a2)
We’re starting to see slightly more detail, and maybe if you squint you see a grimacing mouth. Now let’s see what
image results using 5 components. From our plot of the variance explained 5 components covered a sizeable percentage
of the variation. To save typing, use the up arrow to recall the command which created a2 and replace the a2 and
assignment arrow with the call to myImage, and change the three occurrences of 2 to 5.
myImage(svd1\(u[,1:5] %*% diag(svd1\)d[1:5]) %*% t(svd1$v[,1:5]))
Certainly much better. Clearly a face is appearing with eyes, nose, ears, and mouth recognizable. Again, use the up
arrow to recall the last command (calling myImage with a matrix product argument) and change the 5’s to 10’s. We’ll
see how this image looks.
myImage(svd1\(u[,1:10] %*% diag(svd1\)d[1:10]) %*% t(svd1$v[,1:10]))
Now that’s pretty close to the original which was low resolution to begin with, but you can see that 10 components | really do capture the essence of the image. Singular value decomposition is a good way to approximate data without | having to store a lot.
We’ll close now with a few comments. First, when reducing dimensions you have to pay attention to the scales on which
different variables are measured and make sure that all your data is in consistent units. In other words, scales of
your data matter. Second, principal components and singular values may mix real patterns, as we saw in our simple
2-pattern example, so finding and separating out the real patterns require some detective work. Let’s do a quick
review now.
A matrix X has the singular value decomposition UDV^t. The principal components of X are ?
CLUSTERING
In this lesson we’ll apply some of the analytic techniques we learned in this course to data
from the University of California, Irvine. Specifically, the data we’ll use is from UCI’s Center
for Machine Learning and Intelligent Systems. You can find out more about the data at
http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones. As this
address indicates, the data involves smartphones and recognizing human activity. Cool, right?
Our goal is to show you how to use exploratory data analysis to point you in fruitful directions
of research, that is, towards answerable questions. Exploratory data analysis is a “rough cut”
or filter which helps you to find the most beneficial areas of questioning so you can set your
priorities accordingly.
We’ve loaded data from this study for you in a matrix called ssd. Run the R command dim now to
see its dimensions.
dim(ssd) [1] 7352 563
You got it right!
|======= | 8%
Wow - ssd is pretty big, 7352 observations, each of 563 variables. Don’t worry we’ll only use a
small portion of this “Human Activity Recognition database”.
These last 2 columns contain subject and activity information. We saw above that the gathered
data had “been randomly partitioned into two sets, where 70% of the volunteers was selected for
generating the training data and 30% the test data." Run the R command table with ssd$subject as
its argument to see if the data in ssd contains training or test data.
CASO DE ESTUDIO 2
In this lesson we’ll apply some of the techniques we learned in this course to study air
pollution data, specifically particulate matter (we’ll call it pm25 sometimes), collected by the
U.S. Environmental Protection Agency. This website
https://www.health.ny.gov/environmental/indoors/air/pmq_a.htm from New York State offers some
basic information on this topic if you’re interested.
…
|== | 2%
Particulate matter (less than 2.5 microns in diameter) is a fancy name for dust, and breathing
in dust might pose health hazards to the population. We’ll study data from two years, 1999 (when
monitoring of particulate matter started) and 2012. Our goal is to see if there’s been a
noticeable decline in this type of air pollution between these two years.
…
|=== | 3%
We’ve read in 2 large zipped files for you using the R command read.table (which is smart enough
to unzip the files). We stored the 1999 data in the array pm0 for you. Run the R command dim
now to see its dimensions.
warning messages from top-level task callback ‘mini’ Warning messages: 1: package ‘fields’ was built under R version 3.3.3 2: package ‘spam’ was built under R version 3.3.3 3: package ‘dotCall64’ was built under R version 3.3.3 4: failed to assign RegisteredNativeSymbol for toeplitz to toeplitz since toeplitz is already defined in the ‘spam’ namespace 5: package ‘maps’ was built under R version 3.3.3 > dim(pm0) [1] 117421 5
Keep working like that and you’ll get there!
|==== | 4%
We see that pm0 has over 117000 lines, each containing 5 columns. In the original file, at the
EPA website, each row had 28 columns, but since we’ll be using only a few of these, we’ve
created and read in a somewhat smaller file. Run head on pm0 now to see what the first few lines
look like.
head(pm0) V1 V2 V3 V4 V5 1 1 27 1 19990103 NA 2 1 27 1 19990106 NA 3 1 27 1 19990109 NA 4 1 27 1 19990112 8.841 5 1 27 1 19990115 14.920 6 1 27 1 19990118 3.878
That’s the answer I was looking for.
|===== | 5%
We see there’s some missing data, but we won’t worry about that now. We also see that the column
names, V1, V2, etc., are not informative. However, we know that the first line of the original
file (a comment) explained what information the columns contained.
…
|====== | 6%
We created the variable cnames containing the 28 column names of the original file. Take a look
at the column names now.
cnames\(names Error in cnames\)names : $ operator is invalid for atomic vectors cnames [1] “# RD|Action Code|State Code|County Code|Site ID|Parameter|POC|Sample Duration|Unit|Method|Date|Start Time|Sample Value|Null Data Code|Sampling Frequency|Monitor Protocol (MP) ID|Qualifier - 1|Qualifier - 2|Qualifier - 3|Qualifier - 4|Qualifier - 5|Qualifier - 6|Qualifier - 7|Qualifier - 8|Qualifier - 9|Qualifier - 10|Alternate Method Detectable Limit|Uncertainty”
That’s a job well done!
|====== | 7%
We see that the 28 column names look all jumbled together even though they’re separated by “|”
characters, so let’s fix this. Reassign to cnames the output of a call to strsplit (string
split) with 3 arguments. The first is cnames, the pipe symbol ‘|’ is the second (use the
quotation marks), and the third is the argument fixed set to TRUE. Try this now.
cnames <- strsplit(cnames,“|”,fixed = TRUE)
You are amazing!
|======= | 8%
The variable cnames now holds a list of the column headings. Take another look at the column
names.
cnames [[1]][1] “# RD” “Action Code”
[3] “State Code” “County Code”
[5] “Site ID” “Parameter”
[7] “POC” “Sample Duration”
[9] “Unit” “Method”
[11] “Date” “Start Time”
[13] “Sample Value” “Null Data Code”
[15] “Sampling Frequency” “Monitor Protocol (MP) ID”
[17] “Qualifier - 1” “Qualifier - 2”
[19] “Qualifier - 3” “Qualifier - 4”
[21] “Qualifier - 5” “Qualifier - 6”
[23] “Qualifier - 7” “Qualifier - 8”
[25] “Qualifier - 9” “Qualifier - 10”
[27] “Alternate Method Detectable Limit” “Uncertainty”
Great job!
|======== | 9%
Nice, but we don’t need all these. Assign to names(pm0) the output of a call to the function
make.names with cnames[[1]][wcol] as the argument. The variable wcol holds the indices of the 5
columns we selected (from the 28) to use in this lesson, so those are the column names we’ll
need. As the name suggests, the function “makes syntactically valid names”.
names(pm0) <- make.names(cnames[[1]][wcol])
All that hard work is paying off!
|========= | 10%
Now re-run head on pm0 now to see if the column names have been put in place.
head(pm0) State.Code County.Code Site.ID Date Sample.Value 1 1 27 1 19990103 NA 2 1 27 1 19990106 NA 3 1 27 1 19990109 NA 4 1 27 1 19990112 8.841 5 1 27 1 19990115 14.920 6 1 27 1 19990118 3.878
Excellent work!
|========== | 11%
Now it’s clearer what information each column of pm0 holds. The measurements of particulate
matter (pm25) are in the column named Sample.Value. Assign this component of pm0 to the variable
x0. Use the m$n notation.
pm0\(x0 <- pm0\)Sample.Value
You almost had it, but not quite. Try again. Or, type info() for more options.
Type x0 <- pm0$Sample.Value at the command prompt.
px0 <- pm0$Sample.Value
One more time. You can do it! Or, type info() for more options.
Type x0 <- pm0$Sample.Value at the command prompt.
x0 <- pm0$Sample.Value
All that practice is paying off!
|=========== | 12%
Call the R command str with x0 as its argument to see x0’s structure.
str(x0) num [1:117421] NA NA NA 8.84 14.92 …
You got it right!
|============ | 13%
We see that x0 is a numeric vector (of length 117000+) with at least the first 3 values missing.
Exactly what percentage of values are missing in this vector? Use the R function mean with
is.na(x0) as an argument to see what percentage of values are missing (NA) in x0.
mean(is.na(x0)) [1] 0.1125608
That’s correct!
|============= | 14%
So a little over 11% of the 117000+ are missing. We’ll keep that in mind. Now let’s start
processing the 2012 data which we stored for you in the array pm1.
…
|============== | 15%
We’ll repeat what we did for pm0, except a little more efficiently. First assign the output of
make.names(cnames[[1]][wcol]) to names(pm1).
names(pm1) <- make.names(cnames[[1]][wcol])
Great job!
|=============== | 16%
Find the dimensions of pm1 with the command dim.
dim(pm1) [1] 1304287 5
Perseverance, that’s the answer.
|================ | 18%
Wow! Over 1.3 million entries. Particulate matter was first collected in 1999 so perhaps there
weren’t as many sensors collecting data then as in 2012 when the program was more mature. If you
ran head on pm1 you’d see that it looks just like pm0. We’ll move on though.
…
|================= | 19%
Create the variable x1 by assigning to it the Sample.Value component of pm1.
x1 <- pm1$Sample.Value
Nice work!
|================= | 20%
Now let’s see what percentage of values are missing in x1. As before, use the R function mean
with is.na(x1) as an argument to find out.
mean(is.na(x1)) [1] 0.05607125
That’s a job well done!
|================== | 21%
So only 5.6% of the particulate matter measurements are missing. That’s about half the
percentage as in 1999.
…
|=================== | 22%
Now let’s look at summaries (using the summary command) for both datasets. First, x0.
summary(x0) Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 0.00 7.20 11.50 13.74 17.90 157.10 13217
Nice work!
|==================== | 23%
The numbers in the vectors x0 and x1 represent measurements taken in micrograms per cubic meter.
Now look at the summary of x1.
summary(x1) Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s -10.00 4.00 7.63 9.14 12.00 909.00 73133
You’re the best!
|===================== | 24%
We see that both the median and the mean of measured particulate matter have declined from 1999
to 2012. In fact, all of the measurements, except for the maximum and missing values (Max and
NA’s), have decreased. Even the Min has gone down from 0 to -10.00! We’ll address what a
negative measurment might mean a little later. Note that the Max has increased from 157 in 1999
to 909 in 2012. This is quite high and might reflect an error in the table or malfunctions in
some monitors.
…
|====================== | 25%
Call the boxplot function with 2 arguments, x0 and x1.
| Huh? Did somebody step on the boxes? It’s hard to see what’s going on here. There are so many | values outside the boxes and the range of x1 is so big that the boxes are flattened. It might be | more informative to call boxplot on the logs (base 10) of x0 and x1. Do this now using log10(x0) | and log10(x1) as the 2 arguments. > boxplot(log10(x0),log10(x1))
Warning messages: 1: In boxplot.default(log10(x0), log10(x1)) : NaNs produced 2: In bplt(at[i], wid = width[i], stats = z\(stats[, i], out = z\)out[z\(group == : Outlier (-Inf) in boxplot 1 is not drawn 3: In bplt(at[i], wid = width[i], stats = z\)stats[, i], out = z\(out[z\)group == : Outlier (-Inf) in boxplot 2 is not drawn
That’s the answer I was looking for.
|======================== | 27%
A bonus! Not only do we get a better looking boxplot we also get some warnings from R in Red.
These let us know that some values in x0 and x1 were “unloggable”, no doubt the 0 (Min) we saw
in the summary of x0 and the negative values we saw in the Min of the summary of x1.
Let’s return to the question of the negative values in x1. Let’s count how many negative values
there are. We’ll do this in a few steps.
…
|=========================== | 30%
First, form the vector negative by assigning to it the boolean x1<0.
negative <- x1<0
You’re the best!
|============================ | 31%
Now run the R command sum with 2 arguments. The first is negative, and the second is na.rm set
equal to TRUE. This tells sum to ignore the missing values in negative.
sum(negative,na.rm = TRUE) [1] 26474
That’s the answer I was looking for.
|============================ | 32%
So there are over 26000 negative values. Sounds like a lot. Is it? Run the R command mean with
same 2 arguments you just used with the call to sum. This will tell us a percentage.
mean(negative,na.rm = TRUE) [1] 0.0215034
Excellent job!
|============================= | 33%
We see that just 2% of the x1 values are negative. Perhaps that’s a small enough percentage that
we can ignore them. Before we ignore them, though, let’s see if they occur during certain times
of the year.
…
|============================== | 34%
First create the array dates by assigning to it the Date component of pm1. Remember to use the
x$y notation.
dates <- pm1$Date
All that hard work is paying off!
|=============================== | 35%
To see what dates looks like run the R command str on it.
str(dates) int [1:1304287] 20120101 20120104 20120107 20120110 20120113 20120116 20120119 20120122 20120125 20120128 …
Nice work!
|================================ | 36%
We see dates is a very long vector of integers. However, the format of the entries is hard to
read. There’s no separation between the year, month, and day. Reassign to dates the output of a
call to as.Date with the 2 arguments as.character(dates) as the first argument and the string
“%Y%m%d” as the second.
dates <- as.Date(as.character())
One more time. You can do it! Or, type info() for more options.
Type dates <- as.Date(as.character(dates), “%Y%m%d”) at the command prompt.
dates <- as.Date(as.character(dates), “%Y%m%d”)
You got it right!
|================================= | 37%
Now when you run head on dates you’ll see the dates in a nicer format. Try this now.
head(dates) [1] “2012-01-01” “2012-01-04” “2012-01-07” “2012-01-10” “2012-01-13” “2012-01-16”
That’s the answer I was looking for.
|================================== | 38%
Let’s plot a histogram of the months when the particulate matter measurements are negative. Run
hist with 2 arguments. The first is dates[negative] and the second is the string “month”.
hist(dates[negative],“month”)
We see the bulk of the negative measurements were taken in the winter months, with a spike in
May. Not many of these negative measurements occurred in summer months. We can take a guess that
because particulate measures tend to be low in winter and high in summer, coupled with the fact
that higher densities are easier to measure, that measurement errors occurred when the values
were low. For now we’ll attribute these negative measurements to errors. Also, since they
account for only 2% of the 2012 data, we’ll ignore them.
…
|==================================== | 40%
Now we’ll change focus a bit and instead of looking at all the monitors throughout the country
and the data they recorded, we’ll try to find one monitor that was taking measurements in both
1999 and 2012. This will allow us to control for different geographical and environmental
variables that might have affected air quality in different areas. We’ll narrow our search and
look just at monitors in New York State.
…
|===================================== | 41%
We subsetted off the New York State monitor identification data for 1999 and 2012 into 2
vectors, site0 and site1. Look at the structure of site0 now with the R command str.
str(site0) chr [1:33] “1.5” “1.12” “5.73” “5.80” “5.83” “5.110” “13.11” “27.1004” …
That’s the answer I was looking for.
|====================================== | 42%
We see that site0 (the IDs of monitors in New York State in 1999) is a vector of 33 strings,
each of which has the form “x.y”. We’ve created these from the county codes (the x portion of
the string) and the monitor IDs (the y portion). If you ran str on site1 you’d see 18 similar
values.
…
|======================================= | 43%
Use the intersect command with site0 and site1 as arguments and put the result in the variable
both.
intersect(site0,site1) [1] “1.5” “1.12” “5.80” “13.11” “29.5” “31.3” “63.2008” “67.1015” “85.55”
[10] “101.3”
Not quite, but you’re learning! Try again. Or, type info() for more options.
Type both <- intersect(site0, site1) at the command prompt.
both <- intersect(site0,site1)
Nice work!
|======================================= | 44%
Take a look at both now.
both [1] “1.5” “1.12” “5.80” “13.11” “29.5” “31.3” “63.2008” “67.1015” “85.55”
[10] “101.3”
All that hard work is paying off!
|======================================== | 45%
We see that 10 monitors in New York State were active in both 1999 and 2012.
…
|========================================= | 46%
To save you some time and typing, we modified the data frames pm0 and pm1 slightly by adding to
each of them a new component, county.site. This is just a concatenation of two original
components County.Code and Site.ID. We did this to facilitate the next step which is to find out
how many measurements were taken by the 10 New York monitors working in both of the years of
interest. Run head on pm0 to see the first few entries now.
head(pm0) State.Code County.Code Site.ID Date Sample.Value x0 county.site 1 1 27 1 19990103 NA NA 27.1 2 1 27 1 19990106 NA NA 27.1 3 1 27 1 19990109 NA NA 27.1 4 1 27 1 19990112 8.841 8.841 27.1 5 1 27 1 19990115 14.920 14.920 27.1 6 1 27 1 19990118 3.878 3.878 27.1
All that hard work is paying off!
|========================================== | 47%
Now pm0 and pm1 have 6 columns instead of 5, and the last column is a concatenation of two other
columns, County and Site.
…
|=========================================== | 48%
Now let’s see how many measurements each of the 10 New York monitors that were active in both
1999 and 2012 took in those years. We’ll create 2 subsets (one for each year), one of pm0 and
the other of pm1.
…
|============================================ | 49%
The subsets will filter for 2 characteristics. The first is State.Code equal to 36 (the code for
New York), and the second is that the county.site (the component we added) is in the vector
both.
…
|============================================= | 51%
First create the variable cnt0 by assigning to it the output of the R command subset, called
with 2 arguments. The first is pm0, and the second is a boolean with the 2 conditions we just
mentioned. Recall that the testing for equality in a boolean requires ==, intersection of 2
boolean conditions is denoted by & and membership by %in%.
cnt0 <- subset(pm0,State.code == 36 & Country.site %in% both) Error in eval(expr, envir, enclos) : object ‘State.code’ not found cnt0 <- subset(pm0,State.code==36&Country.site %in% both) Error in eval(expr, envir, enclos) : object ‘State.code’ not found cnt0 <- subset(pm0,State.Code==36&Country.site %in% both) Error in match(x, table, nomatch = 0L) : object ‘Country.site’ not found cnt0 <- subset(pm0,State.Code==36&County.site %in% both) Error in match(x, table, nomatch = 0L) : object ‘County.site’ not found cnt0 <- subset(pm0,State.Code==36&county.site %in% both)
Excellent job!
|============================================== | 52%
Recall the last command with the up arrow, and create cnt1 (instead of cnt0). Remember to change
pm0 to pm1. Everything else can stay the same.
cnt1 <- subset(pm1,State.Code==36&county.site %in% both)
Excellent job!
|=============================================== | 53%
Now run the command sapply(split(cnt0, cnt0$county.site), nrow). This will split cnt0 into
several data frames according to county.site (that is, monitor IDs) and tell us how many
measurements each monitor recorded.
dim(cnt0) [1] 952 7
Not quite, but you’re learning! Try again. Or, type info() for more options.
Type sapply(split(cnt0, cnt0$county.site), nrow) at the command prompt.
dim(cnt1) [1] 328 6
Not quite, but you’re learning! Try again. Or, type info() for more options.
Type sapply(split(cnt0, cnt0$county.site), nrow) at the command prompt.
sapply(split(cnt0, cnt0$county.site), nrow) 1.12 1.5 101.3 13.11 29.5 31.3 5.80 63.2008 67.1015 85.55 61 122 152 61 61 183 61 122 122 7
That’s correct!
|================================================ | 54%
Do the same for cnt1. (Recall your last command and change 2 occurrences of cnt0 to cnt1.)
sapply(split(cnt1, cnt1$county.site), nrow) 1.12 1.5 101.3 13.11 29.5 31.3 5.80 63.2008 67.1015 85.55 31 64 31 31 33 15 31 30 31 31
Great job!
|================================================= | 55%
From the output of the 2 calls to sapply, which monitor is the only one whose number of
measurements increased from 1999 to 2012?
1: 85.55 2: 101.3 3: 63.2008 4: 29.5
Selection: 1
You got it right!
|================================================== | 56%
We want to examine a monitor with a reasonable number of measurements so let’s look at the
monitor with ID 63.2008. Create a variable pm0sub which is the subset of cnt0 (this contains
just New York data) which has County.Code equal to 63 and Site.ID 2008.
pm0sub <- subset(cnt0,County.Code==63&Site.ID==2008)
Keep working like that and you’ll get there!
|================================================== | 57%
Now do the same for cnt1. Name this new variable pm1sub.
pm1sub <- subset(cnt1,County.Code==63&Site.ID==2008)
Excellent work!
|=================================================== | 58%
From the output of the 2 calls to sapply, how many rows will pm0sub have?
1: 30 2: 29 3: 183 4: 122
Selection: 4
You are amazing!
|==================================================== | 59%
Now we’d like to compare the pm25 measurements of this particular monitor (63.2008) for the 2
years. First, create the vector x0sub by assigning to it the Sample.Value component of pm0sub.
x0sub <- pm0sub$Sample.Value
Your dedication is inspiring!
|===================================================== | 60%
Similarly, create x1sub from pm1sub.
x1sub <- pm1sub$Sample.Value
You are quite good my friend!
|====================================================== | 61%
We’d like to make our comparison visually so we’ll have to create a time series of these pm25
measurements. First, create a dates0 variable by assigning to it the output of a call to
as.Date. This will take 2 arguments. The first is a call to as.character with pm0sub$Date as the
argument. The second is the format string “%Y%m%d”.
dates0 <- as.Date(as.character(pm0sub$Date),“Y%m%d”)
You almost had it, but not quite. Try again. Or, type info() for more options.
Type dates0 <- as.Date(as.character(pm0sub$Date),“%Y%m%d”) at the command prompt.
dates0 <- as.Date(as.character(pm0sub$Date),“%Y%m%d”)
Excellent job!
|======================================================= | 62%
Do the same for the 2012 data. Specifically, create dates1 using pm1sub$Date as your input.
dates1 <- as.Date(as.character(pm1sub$Date),“%Y%m%d”)
You are amazing!
|======================================================== | 63%
Now we’ll plot these 2 time series in the same panel using the base plotting system. Call par
with 2 arguments. The first is mfrow set equal to c(1,2). This will tell the system we’re
plotting 2 graphs in 1 row and 2 columns. The second argument will adjust the panel’s margins.
It is mar set to c(4,4,2,1).
par(c(1,2),c(4,4,2,1)) [[1]] NULL
[[2]] NULL
You’re close…I can feel it! Try it again. Or, type info() for more options.
Type par(mfrow = c(1, 2), mar = c(4, 4, 2, 1)) at the command prompt.
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
That’s a job well done!
|========================================================= | 64%
Call plot with the 3 arguments dates0, x0sub, and pch set to 20. The first two arguments are the
x and y coordinates. This will show the pm25 values as functions of time.
plot(dates0,x0sub,pch=20)
Now we’ll mark the median.
…
|=========================================================== | 66%
Use abline to add a horizontal line at the median of the pm25 values. Make the line width 2 (lwd
is the argument), and when you call median with x0sub, specify the argument na.rm to be TRUE.
abline(lwd=2,h = median(x0sub))
Not quite, but you’re learning! Try again. Or, type info() for more options.
Type abline(h = median(x0sub, na.rm = TRUE),lwd=2) at the command prompt.
abline(h = median(x0sub, na.rm = TRUE),lwd=2)
You are quite good my friend!
|============================================================ | 67%
Now we’ll do the same for the 2012 data. Call plot with the 3 arguments dates1, x1sub, and pch
set to 20.
plot(dates1,x1sub,pch=20)
You got it right!
|============================================================= | 68%
As before, we’ll mark the median of this 2012 data.
…
|============================================================= | 69%
Use abline to add a horizontal line at the median of the pm25 values. Make the line width 2 (lwd
is the argument). Remember to specify the argument na.rm to be TRUE when you call median on
x1sub.
abline(h = median(x1sub, na.rm = TRUE),lwd=2)
Nice work!
|============================================================== | 70%
Which median is larger - the one for 1999 or the one for 2012?
1: 2012 2: 1999
Selection: 2
Excellent work!
|=============================================================== | 71%
The picture makes it look like the median is higher for 2012 than 1999. Closer inspection shows
that this isn’t true. The median for 1999 is a little over 10 micrograms per cubic meter and for
2012 its a little over 8. The plots appear this way because the 1999 plot ….
1: shows a bigger range of y values than the 2012 plot 2: shows different months than those in the 2012 plot 3: displays more points than the 2012 plot
Selection: 1
You got it right!
|================================================================ | 72%
The 1999 plot shows a much bigger range of pm25 values on the y axis, from below 10 to 40, while
the 2012 pm25 values are much more restricted, from around 1 to 14. We should really plot the
points of both datasets on the same range of values on the y axis. Create the variable rng by
assigning to it the output of a call to the R command range with 3 arguments, x0sub, x1sub, and
the boolean na.rm set to TRUE.
rng <- range(x0sub,x1sub,na.rm = TRUE)
You nailed it! Good job!
|================================================================= | 73%
Look at rng to see the values it spans.
rng [1] 3.0 40.1
Excellent work!
|================================================================== | 74%
Here a new figure we’ve created showing the two plots side by side with the same range of values
on the y axis. We used the argument ylim set equal to rng in our 2 calls to plot. The
improvement in the medians between 1999 and 2012 is now clear. Also notice that in 2012 there
are no big values (above 15). This shows that not only is there a chronic improvement in air
quality, but also there are fewer days with severe pollution.
…
|=================================================================== | 75%
The last avenue of this data we’ll explore (and we’ll do it quickly) concerns a comparison of
all the states’ mean pollution levels. This is important because the states are responsible for
implementing the regulations set at the federal level by the EPA.
…
|==================================================================== | 76%
Let’s first gather the mean (average measurement) for each state in 1999. Recall that the
original data for this year was stored in pm0.
…
|===================================================================== | 77%
Create the vector mn0 with a call to the R command with using 2 arguments. The first is pm0.
This is the data in which the second argument, an expression, will be evaluated. The second
argument is a call to the function tapply. This call requires 4 arguments. Sample.Value and
State.Code are the first two. We want to apply the function mean to Sample.Value, so mean is the
third argument. The fourth is simply the boolean na.rm set to TRUE.
mn0 <- with(pm0,tapply(Sample.Value,State.Code,mean,na.rm = TRUE))
Nice work!
|====================================================================== | 78%
Call the function str with mn0 as its argument to see what it looks like.
str(mn0) num [1:53(1d)] 19.96 6.67 10.8 15.68 17.66 … - attr(*, “dimnames”)=List of 1 ..$ : chr [1:53] “1” “2” “4” “5” …
All that hard work is paying off!
|======================================================================= | 79%
We see mn0 is a 53 long numerical vector. Why 53 if there are only 50 states? As it happens,
pm25 measurements for the District of Columbia (Washington D.C), the Virgin Islands, and Puerto
Rico are included in this data. They are coded as 11, 72, and 78 respectively.
…
|======================================================================== | 80%
Recall your command creating mn0 and change it to create mn1 using pm1 as the first input to the
call to with.
mn1 <- with(pm1,tapply(Sample.Value,State.Code,mean,na.rm = TRUE))
All that hard work is paying off!
|======================================================================== | 81%
For fun, call the function str with mn1 as its argument.
str(mn1) num [1:52(1d)] 10.13 4.75 8.61 10.56 9.28 … - attr(*, “dimnames”)=List of 1 ..$ : chr [1:52] “1” “2” “4” “5” …
You got it right!
|========================================================================= | 82%
So mn1 has only 52 entries, rather than 53. We checked. There are no entries for the Virgin
Islands in 2012. Call summary now with mn0 as its input.
summary(mn0) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.862 9.519 12.310 12.410 15.640 19.960
That’s correct!
|========================================================================== | 84%
Now call summary with mn1 as its input so we can compare the two years.
summary(mn1) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.006 7.355 8.729 8.759 10.610 11.990
You got it right!
|=========================================================================== | 85%
We see that in all 6 entries, the 2012 numbers are less than those in 1999. Now we’ll create 2
new dataframes containing just the state names and their mean measurements for each year. First,
we’ll do this for 1999. Create the data frame d0 by calling the function data.frame with 2
arguments. The first is state set equal to names(mn0), and the second is mean set equal to mn0.
d0 <- data.frame(state=names(mn0),mean=mn0)
Keep up the great work!
|============================================================================ | 86%
Recall the last command and create d1 instead of d0 using the 2012 data. (There’ll be 3 changes
of 0 to 1.)
d1 <- data.frame(state=names(mn1),mean=mn1)
You are quite good my friend!
|============================================================================= | 87%
Create the array mrg by calling the R command merge with 3 arguments, d0, d1, and the argument
by set equal to the string “state”.
mrg <- merge(d0,d1,by=“state”)
You got it!
|============================================================================== | 88%
Run dim with mrg as its argument to see how big it is.
dim(mrg) [1] 52 3
You are quite good my friend!
|=============================================================================== | 89%
We see merge has 52 rows and 3 columns. Since the Virgin Island data was missing from d1, it is
excluded from mrg. Look at the first few entries of mrg using the head command.
head(mrg) state mean.x mean.y 1 1 19.956391 10.126190 2 10 14.492895 11.236059 3 11 15.786507 11.991697 4 12 11.137139 8.239690 5 13 19.943240 11.321364 6 15 4.861821 8.749336
You nailed it! Good job!
|================================================================================ | 90%
Each row of mrg has 3 entries - a state identified by number, a state mean for 1999 (mean.x),
and a state mean for 2012 (mean.y).
…
|================================================================================= | 91%
Now we’ll plot the data to see how the state means changed between the 2 years. First we’ll plot
the 1999 data in a single column at x=1. The y values for the points will be the state means.
Again, we’ll use the R command with so we don’t have to keep typing mrg as the data environment
in which to evaluate the second argument, the call to plot. We’ve already reset the graphical
parameters for you.
…
|================================================================================== | 92%
For the first column of points, call with with 2 arguments. The first is mrg, and the second is
the call to plot with 3 arguments. The first of these is rep(1,52). This tells the plot routine
that the x coordinates for all 52 points are 1. The second argument is the second column of mrg
or mrg[,2] which holds the 1999 data. The third argument is the range of x values we want,
namely xlim set to c(.5,2.5). This works since we’ll be plotting 2 columns of points, one at x=1
and the other at x=2.
with(mrg,plot(rep(1,52),mrg[,2],xlim = c(.5,2.5)))
Perseverance, that’s the answer.
|=================================================================================== | 93%
We see a column of points at x=1 which represent the 1999 state means. For the second column of
points, again call with with 2 arguments. As before, the first is mrg. The second, however, is a
call to the function points with 2 arguments. We need to do this since we’re adding points to an
already existing plot. The first argument to points is the set of x values, rep(2,52). The
second argument is the set of y values, mrg[,3]. Of course, this is the third column of mrg. (We
don’t need to specify the range of x values again.)
with(mrg,points(rep(2,52),mrg[,3]))
You are amazing!
|=================================================================================== | 94%
We see a shorter column of points at x=2. Now let’s connect the dots. Use the R function
segments with 4 arguments. The first 2 are the x and y coordinates of the 1999 points and the
last 2 are the x and y coordinates of the 2012 points. As in the previous calls specify the x
coordinates with calls to rep and the y coordinates with references to the appropriate columns
of mrg.
segments(rep(1,52),mrg[,2],rep(2,52),mrg[,3])
For fun, let’s see which states had higher means in 2012 than in 1999. Just use the
mrg[mrg\(mean.x < mrg\)mean.y, ] notation to find the rows of mrg with this particulate property.
mrg[mrg\(mean.x < mrg\)mean.y, ] state mean.x mean.y 6 15 4.861821 8.749336 23 31 9.167770 9.207489 27 35 6.511285 8.089755 33 40 10.657617 10.849870
All that hard work is paying off!
|====================================================================================== | 97%
Only 4 states had worse pollution averages, and 2 of these had means that were very close. If
you want to see which states (15, 31, 35, and 40) these are, you can check out this website
https://www.epa.gov/enviro/state-fips-code-listing to decode the state codes.
…
|======================================================================================= | 98%
This concludes the lesson, comparing air pollution data from two years in different ways. First,
we looked at measures of the entire set of monitors, then we compared the two measures from a
particular monitor, and finally, we looked at the mean measures of the individual states.